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Systems on silicon show a continuous increase m complexity due to the ever 
increasing need for implementing new features and improvements of existing 
functions. This is enabled by the increasing density with which components can be 
integrated on an integrated circuit. At the same time the clock speed at which circuits 
are operated tends to increase too. The higher clock speed in combination with the 
increased density of components has reduced the area which can operate 
synchronously within the same clock domain. This has created the need for a modular 
approach. According to such an approach the processing system comprises a plurality 
of relatively independent, complex modules. In conventional processing systems the 
systems modules usually communicate to each other via a bus. As the number of 
modules increases however, this way of communication is no longer practical for the 
following reasons. On the one hand the large number of modules forms a too high bus 
load. On the other hand the bus forms a communication bottleneck as it enables only 
one device to send data to the bus. A communication network forms an effective way 
to overcome these disadvantages. The communication network comprises a plurality 
of partly connected nodes. Messages from a module are redirected by the nodes to one 
or more other nodes. To that end the message comprises first information indicative 
for the location of the addressed modnle(s) within the network. The message may 
further include second information indicative for a particular location within the 
module, such as a memory, or a register address. The second information may invoke 
a particular response of the addressed module. 

It is an object of the invention to provide an integrated circuit and a method 
according to the introductory paragraph, which provides the modules therein a 
relatively simple way of issuing messages. 

In order to achieve said object the integrated circuit is characterized by the 
characterizing portion of claim 1. 

In the integrated circuit according to the invention modules can issue messages in a 
simple way, by using a single address. This makes it possible for a module to perform 
a write action to a particular memory address without being aware of the destination 
which comprises said address is stored. 

In this way the network appears to the model issuing the message as a bus. This 
makes it relatively simple to incorporate already existing modules designed for a bus 
like architecture in an integrated circuit according to the invention. 

As such, processing systems are known, where a processor is coupled via a bus to 
various memories, which each are mapped onto a respective portion of the total 
address rsnge. By way of example a ROM and a RAM may be mapped to a first and a 
second address range respectively. When the processor performs a read instruction, 
the address in the instruction defines at the same time which memory is selected to 
read the data from. 
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hi such krwwnproc^sing systems each of the various modules, such as memorise 2002 
directly coupled to the bus. In the integrated circuit according to the invention 
selecting one of the modules implies that the one or other memories are set in a state 
wherein they do not interfere with the bus traffic. Apart from the memory 1hat is 
addressed no other module is required to perform an action (in Get, they don't have to 
and don t need to know that another module is active - i.e. they don't have to bo 'set 
in a state'), or 2) that multiple concurrent and/or pipelined messages can be active 
simultaneously in the network as a whole, ita an integrated circuit according to the 
invention however, information issued by the active module is transferred as a 
message via one or more nodes of the network. As a consequence it follows a 
different route through the network depending on the address. This route is scheduled 
by the network. 



Examples of the two pieces of information that are arranged as a single address are: 
Single logical memory space/map/range mapped to multiple distributed memories 
each with their own physical memory ranges. 

Virtual memory space mapped to a single logical memory space (distributed or not) 
Multiple memory spaces/maps/ranges mapped to multiple distributed memories For 
2) and 3) two translations may take place (vm -> logical -> physical, and multiple -> 
single -> physical). ^ 

The integrated circuit of claim 3 and the method of claim 4 provide another way of 
improving data transfer in an integrated circuit comprising a plurality of modules 
connected by a network. 

Theoretically a transaction coujd comprise any number of outgoing and/or return 
messages, hi practice however a transaction is made up of one or two outgoing 
messages (from the first to the second module), and zero, one, or two return messages 
(from the second to the first module). By managing the outgoing messages in a way 
different from the return messages the overall efficiency of the network and therewith 
the integrated circuit comprising the network is improved. This is further illustrated 
with the following embodiments. 

With reference to claim 5 it is remarked that GT connections can overbook resources 
in some cases. Por example, when an ANIP opens a GT read connection, it must 
reserve slots for the read command messages, and for the read data messages. The 
ratio between the two can be very large (e.g„ 1;100), which leads either to large slot 
tables, or bandwidth being wasted for the read command messages, In order to 
prevent as much as possible that a reservation for guaranteed traffic would impede 
other transactions the bandwidth which can be reserved should be restricted. On the 
other hand the best effort traffic may use any resources which are currently available. 
As a consequence guaranteed traffic has bounded but on average higher latency than ' 
best-effort traffic which has no fixed upper bound, but is (or should be) fester on 
average. 

Based on this recognition it has been found that the overall quality of the network 
transport could be improved by exploiting BE packets for read command 
messages* and GT packets for read data messages. No guarantees can be offered in 
this case, but the overall throughput can be higher and more stable than in the case of 
using only BE packets. 
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With reference to claim 6 it is remarked that preferably the outgoing transactions are 
handled in a locally ordered and the return transactions in a globally ordered 
transaction mode. The one or more adressed modules process the transactions in the 
order they have been issued, and the return part of the transactions are all delivered to 
the first module in the order in which it initiated the transactions. Even if ordered 
chan nels are used, the responses from different addressed modules (e.g., in a narrow 
cast connection) must be sorted at the first module. This kind of ordering conforms 
with AMBA 

To implement global ordering, transactions that are delivered to different second 
modules (also referred to as slave) must be ordered exactly as they were sent by the 
first module (also referred to as master). This means that the network should either 
have a global time indicator, and use e,g« deadline-based scheduling in the network 
while in addition assumption on the consumption time of the second models must be 
available. An alternatively way to introduce global ordering is to introduce explicit 
dependencies between transactions. The latter can be done by usin g 
acknowledged/tagged transactions, where proof of delivery to the slave is sent bade to 
the master using an acknowledgement message. This solution, however, introduces 
extra latency because transactions are sequentialised with a round-trip delay/latency 
per transaction, (send a message, wait for the acknowledgement, send next message, 
wait for next acknowledgement, etc.). By requiring only a local ordering for the 
delivery of the outgoing transactions, the slaves, provided that they are autonomous 
(which is usually the case) can execute messages independently. 

With reference to claim 7 it is remarked that in this way buffer space is used in 
an efficient way. A particular example is an embodiment wherein a large buffer 
space is reserved for the buffer of the network interface coupled to an active 
module, such as a module isuing a read command, and a small buffer space is 
reserved for the buffer of the network interface coupled to a passive module, e.g, 
the one receiving the read message. 

In other situations there may be different types of flow control (e.g. you never want 
to lose write commands, but don't mind losing read data). If a module can do both 
read and write commands, it may be important that write transactions always succeed 
(e.g.* when writing to an interrupt controller), but that read transactions are not critical 
because they can be retried (so the CMD of the read transaction is dropped and the 
read never executed, or the RETDATA is dropped after the read has been executed- 
Another example is that if you know that writes always succeed if they are delivered, 
a flow-controlled connection is requested. Acknowledgements are not necessary in 
that case; Without flow control acknowledgements are compulsory, complicating the 
master and causing additional traffic. 

In the integrated circuit according to the invention the decision to drop messages or 
not is not decided per transaction but for the outgoing and return parts of connection 
as a whole. For example all outgoing messages having the format reads+address or 
writes+address+data) may be guaranteed lossless, while for all return messages 
(whether read data, write acknowledgements) packets may be dropped. 



A connection could be opened as follows: 
connid = open ( 
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outgoing unordered/local/globaL, 
outgoing buffer size, 
return unordered/locaVglobal, 
return buffer size); 

i,e. all outgoing messages have certain properties, and all return messages have 
certain properties. 

With reference to claim 8 it is remarked that in aprocessing system with modules 
working asynchronously with respect to each other it is usual that a module receiving 
data issues an acknowledge signal to inform the issuing processor that it has received 
a message, hi case that a message is multicast a plurality of said acknowledge signals 
is generated, which imposes a burden for the issuing processor. In the integrated 
circuit of the invention the first module receives only a single message, which reduces 
this burden. This measure is based on the insight that the network usually can 
relatively easily generate the single return message in response to the plurality of 
acknowledge messages of the Second modules as a side effect of the functions already 
present in the network for other purposes. 

With reference to claim 9: Depending on the situation the single return message can 
depend on the acknowled messages in various ways. The embodiment of claim 2 is 
favorable where the addressed second modules are memories, and the first module 
attempts to store data therein. In that case it is sufficient that only one copy of the data 
is really received and stored. 

With reference to claim 10: In other situations it is compulsory that each of the 
addressed second modules has received the data, in the embodiment of claim 10 the 
single return message is not generated until this is the case. 
Otherwise the returnn message could be combined as follows. 
If each of the write transaction has been successfully executed by all slaves all will 
return RETSTAT^RJBTOK, which can be combined by 
the ANIP in a single message to be delivered to the master. 

If the write transaction has been successfully executed only by some slaves, there 
will be a mix of RETSTATs (RETOK and RETBRROR). They can either be 
combined into 

(a) a single RETSTATNRETERROR, to specify that an error occured, or 

(b) a single RETSTAT, but a larger one, more descriptive, encoding 
where there have been errors. All RETSTATs can be bundled together 
in a single RETSTAT for the master, or <fclave identifiers,eiror code> 
pairs can be bundled to form a single RETSTAT for the master. 

If the connection has no flow control, messages can be dropped 
at the PNIPs, resulting also in RETSTAT=RETLOST messages. Again, combinations 
as those above can be made. 
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With reference to claim 11: Id this way it is guaranteed that the first module always 
receives a response to a transaction, even if the connection has no flow control (i.e. 
data may be dropped), This is done by only dropping data in the PNIP (the network 
interface coupled to the second, receiving module), and returning a FAIL/ERROR to 
the ANIP (The network interface coupled to the first module), This return status 
(KBTSTAT) message will never he dropped because the ANIP that initiated the 
transaction will reserve space for retain messages of every transaction that it initiates. 
Tills combination of reserving space and generating an error message whenever a 
message is dropped is a way to introduce flow control. Preferably the RETSTAT 
message is generated by the interface of the receiving module, although alternatively 
it could be generated at the intermediary network nodes too. 
The method according to the invention guarantees transaction completion, i.e. it is 
always known whether an initiated transaction 

(a) was delivered and executed successfully at the slave (RETSTATOK produced by 
the slave), or 

(b) was never delivered at the slave (RETSTAT-KEQLOST produced by the PNIP), 
or 

(c) was delivered at the slave, but not successfiilly executed (RETSTAT=ERROR 
produced by the slave), or 

(d) was delivered and executed successfiilly at the slave but the response message was 
dropped (RETSTAT=RETLOST produced by the ANIP). 

This is achieved by either 

(i) not dropping messages (flow-controlled connection), in this case KBTSTAT is 
either OK or ERROR, or 

(ii) by allowing messages to be dropped (on a connection without flow control), but 
generating a RESTAT (REQLOST or RETLOST) whenever the message is dropped, 
or a RETOK or RETERROR a3 usual when the message is not dropped. 

It is essential however, never to drop RETSTATs, because this completes the 
transartion,This is realized in that a buffer fcr the RETSTAT is located at the master's 
ANIP. The latter reserves space for RETSTATs when initiating transactions, and 
bounds the number of outstanding transactions (for finite sized RETSTAT buffers). 

The flow control on the outgoing and return connections is in principle independent 
Thus, for outgoing flow control & return flow control, the RETSTAT message is 
according to a) or c) above 

In case of outgoing flow control & no return flow control, the RETSTAT message is 
a) or c) or d) above. 

In case of no outgoing flow control & return flow control, the RETSTAT message is 
a) orb) ore) above. 

Other embodiments are such an integrated circuit wherein the return message is a 
message indicating whether the second module has received a message from the first 
module. In this embodiment the return message can be very compact, e.g.. one or two 
bits to indicate one of the four options described above. 
Alternatively or in addition a return message comprises an identification of the 
message received by the second module. 

Page: 7 

1 . I suggest "efficiency" instead of performance", because performance is just one of the 
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factors. We may have the option to reduce the cost of the network (e.g,, reduce buffer sizes), or 
increase the performance (e.g„ by adding more connections for the same resources). 
Page: 3 

2. This is an example for the use of different properties ibr ongoing and return parts. However 
23aorB can be defined: " ' 
V Acknowledged write transaction: write command + outgoing data use guaranteed throughput 

(mode one in your example), and acknowledgment uses best effort (mode two in your exanmle) 
Moreover, except time-related guarantees, there is also a distinction on the buffering in both you and 
the above exanmle. Fox data messages there is potentially more buffering allocated than for commands 
and actoowledgments. Consequently, for a read transaction (your example) buffets for the return part 
would be larger than those for the outgoing part, For the acknowledged write (the example above), 
buffers for the outgoing part are larger, and those for acknowledgments are smaller. 
Page; 8 

3. ^ It is indeed possible to allocate different bandwidths as you suggest However, there are also 
limitations. We use a slot table, which contains a number of slots in a time window. Bandwidth is 
reserved allocating these slots to connections, For exanqrte, if we use a table with 100 slots for a time 
frame of lus, each slot will be allocated for 1/100 from l|xs « 10ns. If die network provides IGb/s per 
link; the bandwidth per slot will be 2/100 from ICbs lOMb/s, We can only allocate multiple of 
lOMb/s for guaranteed throughput traffic. 

For a read command generating long bursts, allocating the minimum bandwidth of 1 OMb/s would be 
prob ably to much, as it will use only a small fraction of it The bandwidth can indeed be used by best- 
effort traffic, however, not by other guaranteed throughput Traffic As a result, not all rhe traffic for 
which guarantees are needed may fit in the slot table. 

An alternative is to use more slots, but this increases the cost of the router. This is why, a best effort 
command may be a better solution. 
Page: 8 

4. This definition is good for outgoing messages, as there is one source (AND?) and potentially 
multiple destinations (PNXPs). However, ibr return messages, we define global/local ordering as 
follows. Global ordering means that responses from all PNEPs/slaves (Le. sources of messages in this 
case) com in the same order as die transactions have been initiated the same order as the 
commands have been issued by the master to the ANIP). Local ordering guarantees the order of 
response only if they come from the same slave/PNIP. 

5Page: $ 

Slave modules 

Page: 8 

6\ 'We can only guarantee ihe order we offer transactions to the slave module, but the order of 
processing depends on the module iraplemsraation. It can well decide to process transactions in a 
different order (e.g., memory controller). Pot ordering we only require the responses are returned in 
the same order as the transactions were accepted. 
Page: S 

7. This is only valid for global ordering. For local ordering (Le t , order preserved only per slave), 
if ordered transport channels are used, no sorting is necessary* . 

Page: 8 

8. Global ordering of responses conforms with AMB A, Local ordering of responses does not 
Page: 8 

9 . I think Kces m e a nt write transactions may be critical and we don't want to loose them, but 
read transactions can be lost, because they can be tried later. See example below in the text. 
Page: 8 

10. The two commands (i.e., read and write) can indeed be sent from the same module. If we set 
up a connection with flow control for the outgoing part both commands will be delivered. However, if 
the return part has do flow control, the responses for read commands may be lost la such a case, the 
read transactions will fail. I think Kces meant read transactions being lost, not read commands being 
lost 

Page: 8 

11. fc = flow control, nofc = no flow control 
Page: 8 

12. Buffer is reserved only for a return status message, Such as an acknowledgment, or an error 
message, Bufier can be, but is not necessarily reserved also for returned data. 
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Page! 8 

13. Data can also be dropped at the ANIP (Le., RETDATA) when no flow control is implemented 
for the return pert In such a case, a RETSTAT=RETLOST win replace the RETSTAT^RETOK 
which accompanied the dropped RETDATA, 

Page: 8 

14. Has reserved 
Page: 8 

15. Yes, thia is true. Between routers, there is always link level flow control and no data is ntyver 
lost Data can be lost only in the network interfaces, if no end-to-end flow control (here referred 
simply as flow control) is implemented Therefore, here, messages reach the PISflP even when no (end- 
to-end) flow control is irnplemented. 



These and other aspects are described in more detail in the following three annexes 

1. Communication Services for Networks on Chip, pages 1-25 by Andrei 
RMulescu and Kees Goossens; 

Further background information useful for implementing the invention can be found 
at: 

2. Networks on Silicon: Blessing or Nightmare? pp 1-5. by Paul Wielage and 
Kees Goossen& (published), and 

3. Trade-Ofis in the Design of a Router with Combined Guaranteed and Best- 
Effort Services for Networks on Chip, pp 1-6, by Edwin Rijpkema, Kees Goossens, 
Andrei Radulescu, Jef van Meerbergea, and Paul Wielage, submitted to and rejected 
byISSS2002. 
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IT. Introduction 



a , fltSl ,?? °*° C) haVer8Ceived ^tion recently 

f * ^ laterc T leot » Mghly-complex ^ p_ 5> 7- 

9, 15, 19, 22]. The reason is twofold, First, NoCs help resolve fee electri- 
cal problems in new deep-sutoicron technologies, as they structure and 
manage global wires [3-5, 7, 8J. At the same time they share wires, lower- 
ing then-number and increasing their utilization [7, 8J. NoCs can also be 
or^ efficient and reliable [4], and are scalable compared to buses [97 
Second, NoCs also decouple computation fiorn communication, which L 
essential in managing the design of billion-transistor chips [14, 221 NoCs 
achieve this decoupling because they are traditionally designed using mt> 
tocol stacks [21], which provide well-defined interfaces separating com- 
munication service usage tram service implementation [5, 22] 
^• U ^ S ^f tWOTkS fOT ° n "° hip coninlunic ation when designing systems on 
Chip (SoC), however; rases a number of new issues that must be taken 
into account This is because, in contrast to existing on-chip interconnects 
C&g., buses, swtehes, or point-to-point wires), where the commnnicamwr 
modules are directly connected, in a NoC the modules communicate re! 
motely via network nodes. As a result, interconnect arbitration changes 
ftom centralized to distributed, and issues like ouw>f order transactions, 
higher latencies, and end-to-end flow control must be handled either by 
the intellectual property block (ip) or by the network itself. 

Most of these topics have been already the subject ofmsearchin the field 
of computer networks [24] andparaUel machine interconnect networks [6] 
However, on-chip networks have differentprepertiea (e.g., tighter link syn- 
chremzanon) and constraints (e.g., highar memory cost) leading to dif&r- 
em design choices, which in the end affect the network services. 

lb this paper, we compare NoCs and off-chip networks showing both 
their similarities and differences. We also explore the Jur^rm between 
NoCs and existing on-chip interconnects. We list new issues that must be 
resolved in sysrem design due to the multi-hop nature ofNoCs, and present 
an interlace which takes these issues into consideration. Our interface are 
aimed at being similar to a split-transaction bus interface, such as VQ [25] 
or OOP [17], to allow simple, low-cost wrappers to bus interfaces, and 
to allow backward compatibility with existing IPs. Our interface uses a 
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request-response protocol that provides basic read and write operations. 
But our interface extends bus interfaces to fully exploit the power of our 
NoC [8, 19,20], For example, it offers connection-based communication 
where end-to-end flow control and time-related guarantees (e,g. s bounded 
latency) can be requested. 

The paper i$ organized as follows. In the next two sections we compare 
NoCs properties with those of off-chip networks and buses, respectively. 
In Section IV, we present the services we offer in our network. Finally, we 
present our conclusions, 



. Networks have been the subject of research for decades, both in the 
context of local and wide area networks (computer networks) [24], and as 
an interconnect for parallel machines [6]. Both are very much related to on- 
chip networks, and many of the results in those fields are also applicable 
on chip. However, NoC 7 s premises are different from off-cMp networks, 
and, therefore, most of the network design choices must be reevaluated. 

NoCs differ from off- chip networks mainly in their constraints and syn- 
chronization. Typically, most on-chip resources have much tighter con- 
straints compared to off-chip. Storage (i.e., memory) and computation re- 
sources are relatively more expensive, whereas the number of point-to- 
point links is larger on chip than off chip [7]. 

Storage is expensive, because general-purpose on-chip memory, such as 
RAMs, occupy a large area. Having the memory distributed in the network 
components in relatively small sizes is even worse, as the of overhead area 
in the memory then becomes dominant 

Also computation for on-chip networks comes at a relatively high cost 
compared to off-chip networks* An off-chip network interface usually con- 
rains a dedicated processor to implement the protocol stack up to network 
layer or even higher, to off-load the host processor from the communica- 
tion processing. Including a dedicated processor in a network interface is 
not feasible on chip, as the size of the network interface will become com- 
parable to or larger than the IP to be connected to tha network- Moreover, 
running the protocol stack on the U» itself may also be not feasible, be- 
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cause often these IPs have one dedicated function only, and da not have 
the capabilities to run a network protocol stack, 

The number of wires and pins to connect network components i s an 
order of magnitude larger on chip than off chip [7], If they are not used 
massively for other purposes man NoC, they allow wide poinMcpoint in- 
temonnects (e.g. 300-bit links) [7, 15J. This is notpossible off-chip, where 
hnks are relatively narrower 8-16 bits. 

On-chip wires are also relatively short, aflowiag a much tighter syn~ 
chromzanon than off chip. This allows a reduction in the buffer space in 
the routers because the communicariQn can be done at a smaller granu- 
larity. In the current semiconductor technologies, wires are also fast and 
reliable, which allows simpler link-layer protocols (e.g., no need for er- 
ror correction, or retransmission). This also compensates for the lack of 
memory and computational resources. 

In the rest of the section, we list five network Ibbugs tfrat have a direct 
impact on the NoC cost: reliable communication, deadlock, data ordering 
network flow control and buffering strategy, and time-related guarantees! 
For each of mem, we discuss the differences and sinuTarities for on- and 
ofF-chip networks. 
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Reliable communication. A consequence of the tight on-chip re~ 
source constraints is that the network components <i.e., routers and net- 
work interfaces) muse be fairly simple to minimiz* computation andmem- 
ory requirements. Luckily, on-chip wires provide a reliable conirnunicatton 
medium, which avoids the considerable overhead incurred by the off-chip 
networks for providing reliable communication. Data integrity can be pro- 
vided at low cost at the data link layer. However, data loss also depends 
on the network architecture, as in most computer networks data is sim- 
ply dropped if congestion occurs in the network [6,24], On-chip, dropping 
data may lead to a too costly implementanon of reliable communication. 
We show below that a network where no data is dropped can lead to a much 
lower-cost solution, at the peril of introducing the possibility of deadlock. 

Deadlock. Oanpuler network topologies have generally an irregular 
(possibly dynamic) structure and bidirectional links, which can introduce 
buffer cycles, In such topologies, packet dropping at the network nodes 
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may be required to avoid deadlocks, 

Deadlock can also be avoided without dropping data, for example, by 
introducing constraints either in the topology or routing. Fat-tree topolo- 
gies have already been considered forNoCs, where deadlock is avoided by 
bouncing back packets in the network in case of overflow [9], Tile-based 
approaches to system design [7, 1 5, 23] use mesh or torus network topolo- 
gies, where deadlock can be avoided using, for example, a turn-model rout- 
ing algorithm [6], 

An alternative solution for deadlock in NoCs, which takes mto consider- 
ation that modules connecting to the network are either masters (initiating 
requests and receiving responses), or slaves (receiving requests and send- 
ing back responses), is to maintain separate virtual networks (with separate 
buffers) for requests and responses (6]. 



Data ordering. In a network, data sent from a source to a destina- 
tion may arrive out of order due to reordering in network nodes, following 
different routes, or retransmission alter dropping. Fox off-chip networks 
out-of-order data delivery is typical However, for NoCs where no data is 
dropped, data can be forced to follow the same path between a source and 
a destination (deterministic routing) with no reordering. This in-order data 
transportation requires less buffer space, and reordering modules are no 
longer necessary; 



Network flow control and buffering strategy* Network flow con- 
trol and buffering strategy have a direct impact on the memory utiliza- 
tion in the network, Wonnhole routing requires only a flit buffer in the 
router, whereas store-and-forward and virtual-cut-through routing require 
at least the buffer space to accommodate a packet Consequently, on chip, 
wonnhole routing may be preferred over virtual-cut-through or store-anoV 
forward routing. Similarly, input queuing may ba a lower memory-cost al- 
ternative to virtuai-output-queuing or output-queuing buffering strate gi e s , 
because it has fewer queues. Dedicated rifo memory structures at a low cost 
also enable on-chip usage of virtual-cut-through routing or virtual output 
queuing for a better performance [19]. However, using virtual^ut-through 
routing and virtual output queuing at the same time is still too costly [1 9]. 
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Figure 2* A bus 



Timer-related guarantees* Off-chip networks typically use packet 
switching and offer best^fibrt services. Contention can occur at each net- 
work node, making latency guarantees very hard to offer, Throughput guar- 
antees can still be offered using schemes such as rate-based switching [26] 
or deadline-based packet switching [18], but with high buffering costs. 

An alternative to provide such time-related guarantees is to use time- 
division multiple access (TDMA) circuits, where every circuit is dedicated 
to a network connection. Circuits provide guarantees at a relatively low 
memory and computation cost. Network resource utilisation is increased 
when the network architecture allows any leftover guaranteed bandwidth 
to be used by best-effort comunmication [10, 19,20]. 
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m. From buses to NoCs 

fctfi»ducuig networks (Figure I) as on-cJup interconnects radically 
changes the communication when compared to direct interconnects, such 
as buses or switches (Figure 2). This is because of the multi-hop nature 
of a network, where communication modules 3re not directly connected, 
but separated by one or more network nodes. This is in contrast with the 
prevalent existing imeimmects (i.e., buses) where modules are directly 
connected The implications of this change reside in the arbitration (which 
must change faun centralized to distributed), and in the coironuwcation 
properties (e>g. a ordering, or flow control). 
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Si this section, we list some of these topics, and outline the differ- 
ences of NoCs and buses. We refer mainly to buses as direct intercon- 
nects, because currently they are the most used on-chip interconnect Most 
of the bus characteristics also hold for other direct interconnects (eg,, 
switches [16*}). Multilevel buses are a hybrid between buses and NoCs. 
Depending on the functionality of the bridges, for our purposes, multilevel 
buses either behave like simple buses [2] or his NoCs. 



Programming Model. The programming mode] of a bus typically 
consists of load and store operations which are implemented as a se- 
quence of primitive bus transactions. Bus interfaces typically have dedi- 
cated groups of wires for command, address, write data, and read data [I> 
12,13,17,251. 

A bus is a resource shared by multiple IPs. Therefore, before using it, 
ips must go through an arbitration phase, where they request access to the 
bus, and block until the bus is granted to them. 

A bus transaction involves a request and possibly a response. Modules 
issuing requests are called masters, and those serving requests are called 
slaves. If there is a single arbitration for a pair of request-response, the 
bus is called non-split. In mis case, the bus remains allocated to the master 
of the transaction until the response is delivered, even when this takes a 
long time. Alternatively, in a split bus, the bus is released after the request 
to allow transactions from different masters to be initiated. However, a 
new arbitration must be performed for the response such that the slave can 
access the bus [II]. 

For both split and non-split buses, both communication parties have di- 
rect and immediate access to the status of the transaction. In contrast, net* 
work transactions are one-way transfers from an output buffer at the source 
to an input buffer ax the destination that causes some action at the destina- 
tion, the occurrence of which is not visible at the source [61. The effects of 
a network transaction are observable only through additional transactions. 
A request-response type of Operation is still possible, but requires at least 
two distinct network transactions. Thus, a bus-like transaction in a NoC 
will essentially be a split transaction. 
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Transaction Ordering. Traditionally, on a bus all tnmsacticms are 
ordered (cf. Peripheral VCI [25], AMBA [1J, or CoreCoimect PLB and 
OPB [12, 13]). This is possible at a low cost, because die interconnect, 
being a direct link between the cororottmcatmg parties, does not reorder 
of data. However, on a split bus, a total ordering of transactions on a sin- 
gle master may still cause performance penalties, when slaves respond at 
different speeds, lb solve this problem, recent extensions to bus protocols 
allow transactions to be performed on connections. Ordering of transact 
tions within a connection is still preserved, but between connections there 
are no Ordering constraints (e.g, OCP [17], or Basic VCI [25]). A few of 
me bus protocols allow out-of-order responses per connection hx their ad- 
vanced modes (e,g„ Advanced VCI [25]), but both requests and responses 
arrive at the destination in the same order as they were sent 

In a NoC, ordering becomes weaken Global ordering can only be pro- 
vided at a very high cost due to the conflict between die distributed nature 
of the networks, and the requirement of a centralized arbitration necessary 
for global ordering. 

Even local ordering, between a source-destination pair, may be costly 
Data may arrive out of order if it is transported over multiple routes In 
such cases, to still achieve an in-order delivery, data must be labeled with 
sequence numbers and reordered at the destination before being delivered. 



Atomic Chains of Transactions- An atomic chain of transactions is 
a sequence of transactions initiated by a single master that is executed on 
a single slave exclusively. That is, other masters are denied access to that 
slave, once the first transaction in the chain claimed it. This mechanism is 
widely used to implements synchronization mechankms between master 
modules (e.g., semaphores). 

On a bus, atomic operations can easily be implemented, as the central 
arbiter win either (a) lock the bus for exclusive use by the master request- 
ing the atomic chain, or (b) know not to grant access to a locked slave. 
- -In the former case, me time resources are locked is shorter because once 
a master has been granted access to a bus, it can quickly perform an the 
transactions in the chain (no arbitration delay is required for the subsequent 
transactions in the chain). Consequently, the locked slave and the bus can 
be opened up again in a short tune. This approach is used in AMBA, and 
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CoreConnect. In the latter case, the bus is not locked, and can still be used 
by other modules, however, at the price of a longer locking time of ffce 
slave. This approached is used in VCI and OCR 

In a NoC, where the arbitration is distributed, masters do not know that 
a slave is locked Therefore, transactions to a locked slaved may still be 
initiated, even though the locked slave cannot accept them. Consequently, 
to prevent deadlock, these other transactions must be cither dropped, ox 
stored such that transactions In the atomic chain can be filtered and still be 
served. Moreover the time a module is locked is much longer in case of 
NoGs, because of the higher latency per Transaction. 

Deadlock. In buses, ma deadlocks are not generally an issue. Dead- 
lock can still occur at the application level (e.g., an atomic chain of trans- 
actions mat loclcs the bus, which is never finished)* but it is not caused by 
the interconnect itself > 

In a network, deadlock becomes a more important issue, and special 
care has to be taken in the network design to avoid deadlock. Deadlock is 
mainly caused by cycles in the buffers. To avoid deadlock, either network 
nodes must drop packets when their buffer are filled, or routing must be 
cycle-free, m a NoC, we believe the latter is preferable, because of its 
lower cost in achieving reliable communication (see Section II). 

A second cause of deadlock are atomic chains of transactions, the rea- 
son is that while a module is locked, me queues storing transactions may 
gat filled with transactions outside the atomic transaction chain, blocking 
the access of the transaction in the chain to reach the locked module. If 
atomic transaction chains must be implemented (to be compatible with 
processors allowing mis, such as MIPS), the network nodes should be able 
to filter the transactions in the atomic chain, or be allowed to drop those 
blocking them. 

Media Arbitration. An important difference between buses and 
NoCs is in the media arbitration scheme. In a bus, master modules re- 
quest access to the interconnect, the arbiter grants the access for the 
whole interconnect at once. Arbitration i$ centrafizdd as there is only one 
arbiter component, and global as all the requests as well as the state of the 
interconnect are visible to the arbiter. Moreover, when a grant is given, the 
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complete path from the source to the destination is exdiisivelyreserved. 

JSlT^ b ,f', arbitra * m ^ Place once when a transaction fa 
moated. As a result, the bus is granted for both request and response. m a 
split has, requests and responses are arbitrated separately. 

In a NoC arbitration is also necessary, as it is a shared interconnect 
However m contrast to buses, the arbitration is distributed, because it is 
performed in every router, and is based only on local information. Arbi- 
tomon of me communication resources (links, buffas) is performed incre- 
mentally as the request or response advances [19]. 

Destination Name and Routing. For a bus, the command, address, 
and data are broadcasted on the interconnect. They arrive at every destina- 
tion, of wmcnone activates based on the broadcasted address, and executes 
tne requested command This is possible because all modules are direetlv 
connected to the same bus. 

ma NoC, it is not feasible to broadcast information to all destinations 
because it must be copied to all routers and network interfaces. This floods 
the network with date. The address is better decoded at the source to find a 
rouie to the destination module. A transaction address will therefore have 
«vo parts: (a) a destination identifier, and (b) an internal address at the 
destination. 

Latency. Transaction latency is caused by two factors; (a) the access 
tone to the bus, which is me time until the bus is granted, and (b) the 
latency introduced by the interconnect to transfer the data. 

For a bus, where the arbitration is centralized the access time is pro- 
portional to the number of masters connected ro the bus. The transfer la- 
tency itself typically is constant and relatively fist, because the modules 
are linked directly. However, the speed of transfer is limited by the bus 
speed, which i$ relatively slow for buses. 

la a NoC, arbitration is performed at each router for the following link 
The accesatimeper router is small. Both end-to-end access time andtrans- 
port tune increase proportionally to the number of hops between master 
and slave. However network links are unidirectional and point to point, 
and hence can nn at higher frequencies than buses, thus lowering the la- 
tency. ** 
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From a latency prospective, using a bus or a network is a trade off be» 
tween the number of modules connected to the interconnect (which affects 
access time), the speed of the interconnect, and the network diameter. 

Data Format In most modem bus interfaces the data format is de- 
fined by separate wire groups for me transaction type, address, write data, 
read data, and return acknowledgments/errors (e,g. 3 Vd, OCP, AMBA, or 
Core Conn act). This is used to pipeline transactions. For example, concur* 
rently with sending the address of a read transaction, the data of a previous 
write transaction can be sent; and the data from an even earlier read nans- 
action can be received. Moreover, having dedicated wire groups simplifies 
the transaction decoding there is no need for a mechanism to select be- 
tween different kinds of data sent over a common set of wires. 

Inside a network, there is typically no distinction between different 
kinds of data. Data is treated uniformly, and passed from one router to 
another, This is done to mmfrrrifle the control overhead and buffering in 
routers. If separate wires would be used for each of the above-mentioned 
groups, separate routing, scheduling and queuing would be needed, in- 
creasing the cost of routers. 

In addition, in a network at each layer in the protocol stack, control in- 
formation must be supplied together with the data (e.g» command type, 
address, or data size). This control information is organized as an envelope 
around the data. Thai is, first a header is sent, followed by ma actual data 
(payload), followed possibly by a trailer. Multiple such envelopes may be 
provided for me same data, each carrying the corresponding control infor- 
mation for each layer in the network protocol stack [6,24], 

Buffering and How Control. Buffering data of a master (output 
buffering) is used both for buses and NoCs to decouple computation from 
communication. However, for NoCs output buffering is also needed to 
marshal data, which consists of (a) (optionally) splitting the outgoing data 
in smaller packets which are transported by the network, and (b) adding 
control information for the network around the data (packet header). To 
avoid output buffer overflow the master must not initiate transactions that 
generate more data man the currently available space. 

Similarly to output buffering, input buffering is also used to decouple 
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computation from communication, In a NoC, input buffering is also re- 
quired to umnarshal data. 

In addition, flow control for input buffers differs for buses and NoCs. 
For buses, the source and destination are direcdy linked, and, destination 
can therefore signal directly to a source that it cannot accept data. This 
infonnation can even be available to the arbiter, such that the bus is not 
granted to a transaction trying to write to a full buffer. 

In aNoC however, the destination of a transaction cannot signal di- 
recrly to a source that its input buffer is rulL Consequently, transactions 
to a destination can be started, possibly from multiple sources, after the 
destination's input buffer has filled up. Two policies can be adopted when 
an input buffer is full The first is not to accept additional incoming transi- 
tions, and to store them in the network. However, this approach can easily 
lead to network congestion, as the data could be eventually stored all the 
way to the sources, blocking the hnks in between. The second approach is 
to accept incoming transactions at a lull destination, and drop some data 
in the input buffer. Congestion is avoided but data is lost, which is unde- 
sirable. 

To avoid output buffer overflow connections can bo used, together vyith 
end-to-end flow control. At connection set up between a master and one 
.or more slaves, bufxer space is allocated at the network interfaces of the 
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the amount of buffer space at the slaves. The master can only send data 
when it has enough credits for the destination slave(s). The slaves grant 
credits to the master when they consums data. 



IV. The ethereal Approach 

As described in the previous two soctions, NoCs have different prop- 
erties from both existing off-chip networks and existing on-chip inter- 
connects, As a result, existing protocols and service interlaces cannot be 
adopted directly to NoCs, but must take the characteristics of NoCs into 
account. For example, a protocol such as TCP/IP assumes the network is 
lossy, and includes significant complexity to provide reliable communica- 
tion. Therefore, it is not suitable in a NoC where we assume data transfer 
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reliability is already solved at a lower level On the other hand, existing 
on-chip protocols such as VCL, OOP, AMBA, or CoreConnect are also not 
directly applicahle. For example, they assume ordered transport of data: 
if two requests are initiated from the same master, they will arrive in the 
same order at the destination. This does not hold automatically for NoCs. 
Atomic chains of transactions and and-to-end flow control also need spe- 
cial attention in a NoC interface, 

Our objectives when defining our network services are the following. 
First, the services abstract from the network internals as much as possible. 
This is a key ingredient in tackling the challenge of decoupling the com- 
putation from communication [14,22], which allows IPs (the computation 
part), and the interconnect (the communication pare) to be designed inde- 
pendently from each other. As a consequence, our services are positioned 
at the transport layer in che ISO-OSI reference model [24], which is the 
first layer to be independent of the implementation of the network. 

Second, we aim at a NoC interface as close as possible to a bus inter- 
face. NoCs can then be introduced non-disruptively: with minor changes, 
existing IPs, methodologies and tools can continue to be used. As a conse- 
quence* we use a request-response interface, similar to interfaces for split 
buses [1,12, 13, 17,25]. 

Third, our interface extends traditional bus interfaces to fully exploit 
the power of NoCs. For example, we offer connection-based communica- 
tion which does not only relax ordering constraints (as for buses), but also 
enables new communication properties, such as end-to-end flow control 
based on credits, or guaranteed throughput [8, 19, 20]. All these properties 
can be set for each connection individually. 



A. The /Ethereal Connection and Transaction Model 

IPS interact with our network [8, 19, 20] at so-called network interfaces 
(ni). nis provide NI ports (NIP) through which the communication services 
are accessed. As shown In Figure 3, a NI can have several Nips to which one 
or more tt»s (computation elements or memories, but not interconnection 
elements) can be connected. Similarly, an IP can be connected to more man 
one NTs and nips. 
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Figure 3. Examples of links between ms and IPs, 

Cammumcation between nips is performed on connections. Connec- 
tions are introduced to describe and identify communication with different 
properties, such as guaranteed throughput, bounded latency and jitter, or- 
dered deliver* or flow control. For example, to distinguish and indepen- 
dently guarantee communication of lMbs and 25Mbs f two connections 
can be used Two Ni*s can be connected by multiple connections, possi- 
bly with different properties. Connections as defined here are similar to the 
concept of threads and connections from OCB and Va. Where in OCP and 
VC3 connections are used only to relax transaction ordering, we generalise 
from only the ordering property to include confgmarion of buffering and 
flow control, guaranteed throughput, and bounded latency per connection. 

^Ethereal connections must be created with the desired properties before 
being used. Tins may result in resource reservations inside the network 
(e.&, buffer space, or percentage of the link usage per time unit). If the 
requested resources are not available, the network will refuse the request 
After usage, connections are closed, which leads to freeing the resources 
occupied by that connection. 

To allow more flexibility in configuring connections, and, hence, better 
resource allocation per connection, the outgoing and return parts of con- 
nections are configured separately. For example, different buffer space can 
be allocated in the AJTO and PNIfrs, respectively, or different bandwidths 
can be reserved for requests and responses. 

Depending on the requested services, the lime to handle a connec- 
tion (i.e., creating, closing, modifying services) can be short (e.g., creat- 
ing/closing an unordered, lossy, best-effort connection) ox significant (e.g., 
creating/closing a multicast guaranteed-throughput connection). Couse- 
— quently r connections are assumed to be created, closed, or modified infre- 
quently, coinciding eg. with reconfiguration points, when the application 
requirements change. 

Communication takes place on connections using transaction, Consist- 
ing of a request and a possibly response, The request encodes an operation 
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Figured Transaction composition. 



(e.g., read, write, flush, test and se^ nop) and possibly carries outgoing 
data (e.g., for write commands). The response returns data as a result of a 
command (e.g., read) and/or an acknowledgment 

Connections involves at least two NH% Transactions on a connection 
are always started at one and only one of the nips, called the connection's 
active nip (anif). All the other nips of the connection are called passive 

NIP^PNiP). 

There can be multiple transactions active on a connection at a time (as 
for split buses). That is, transactions can be started at the AN IP of a connec- 
tion while responses tor earlier transactions are pending. If a connection 
has multiple slaves, multiple transactions can be initiated towards different 
slaves. Transactions are also pipelined between a single pair of a master 
and a slave for both requests and responses- In principles transactions can 
also be pipelined within a slave, if the slave allows tins* 

A transaction is composed from the following messages (see Figure 4): 

• A command message (CMD) is sent by the anip, and describes the 
action to be executed at the slave connected to die pnip. Examples 
of commands are read, write, test and set, and flush* Commands 
are the only messages that are compulsory in a transaction. For 
NIPS that allow only a single command with no parameters (e.g., 
fixed-size address-less write), we assume the command message 
still exists, even if it is implicit (i.e., not explicitly sent by the IP). 

• An out data message (outdata) is gent by the anep following a 
command that requires data to be executed (e.g„ write, multicast; 
and test-and-set). 

* A return data message (retdata) is sent by a pnip as a conse- 
quence of a transaction execution that produces data (e,g., read, 
and test-snd-set). 

♦ A cornpletion acknowledgment message (rbtstat) is an optional 
message which is returned by PWP when a command has been 
completed. It may signal either a successful completion or an er- 
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Figures. Transaction examples. 

ror. For transactions including both RETdata and REtstaT the 
two messages can be combined in a single message for efficiency 
However, eouceptually, mey exist both; Retstat to signal the 
presence of data or an error, and retdata to carry me data. Jn 
bus-based interfaces Rbtpata and re tstat typically exist as two 
separate signals [1, 12, 13, 17,25], 

Messages composing a transaction are divided in outgoing messages, 
nam% c^and outdata, and response messages, namely rbtdatZ 
RETSTAT. Wtto a transaction, CMD precedes all other messages and 
RBTOata precedes retstat if present These rules apply both between 
master and anip, and PNIP and slave. Examples of transactions are shown 
in Figure S. 

We classify connections as follows (see Figure 6): 

• A simple connection is a connection beiween one anip and one 
PNIP. 

• A narrowcast connection is a connection between one A*fiP and 
one or more p*jips, in which the anip initiates transactions that 
are executed by. exactly one pnip. An example of the nairow- 
cast connection is shown in Figure 7, where the anip performs 
transactions on an address space which is mapped on two mem- 

— cry modules. Depending on the transaction address, a transaction 
is executed on only one of these two memories. 

• A multicast connection is a connection between one anip and 
one or more pnips, in which the sent messages are duplicated and 
each pnip receives a copy of those messages, In a multicast con- 



PHNL021031EPF 




xr~z3 ur in. rax crfj^tw 




08.10.2002 18:12:36 



08.10.2002 



COMMUNICATION SERVICES FOR NCOS 



17 



simple: ANIP 



PNIP 





PNIP 
PNIP 
PNIP 



PNIP 
PNIP 
PNIP 




muftfcast ANIP 



Figure 7. Anarrowcast 
connection. 



Figure 6, Connection types. 

nection no return messages are currently allowed, because of the 
large traffic they generate (Lc* end response per destination). It 
could also increase the complexity in the Atffr because individual 
responses from *w*s must be merged into a single response for 
the AM [P. This requires buffer space and/or additional computa- 
tion for the merging itself. 



In this section we describe the properties that can be configured for 
a connection: guaranteed message integrity, guaranteed transaction com- 
pletion, various transaction orde rings, guaranteed throughput, bounded la* 
tency and jitter, and connection flow control. 



Data Integrity. Data integrity means thai the payload of the message 
is not changed (accidentally or not) during transport. We assume that data 
integrity is already solved at a lower layer in our network* namely at the 
link layer, because in current on-chip technologies data can be transported 
uncorrupted over links. Consequently, our network inter fa ce always guar- 
antees that messages are delivered uncorrupted at the destination, 

transaction Completion. A transaction withour a response is said to 
be complete when it has been executed by the slave. As there is no response 
message to the master, no guarantee regarding transaction completion can 
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Figures. Message ordering is observable at a, b, c, and A 
be given. 

A transaction with a response is said to be complete when a RBTSTAT 
message « received from the AND-'. The transaction may either (a) be 
executed successfully, in which case a success rbtstat is returned, (t» 
fa <*™*°* at the slave, and then an execution error retstat is 
returned, or (c) fail because ofWer overflow-in a connection with no flow 
control, and men it reports an overflow enot, 

In our network, routers do nor drop data poj. therefore, messages are 
always guaranteed to be delivered at the NI. For connections wilTflow 
eontiol, also Nis do not drop data. Thus, message delivery to the IPs is 
guaranteed automatically in flu's case. 

However, if there is no flow control, messages may be dropped at the 
network interfece in case of buffer overflow (see the paragraph on end-to- 
end flow control below). All of Cmd, outdata, and retoata may be 
drepped at the NI. Tb guarantee transaction completion, retstat is not 
allowed to be dropped. Consequently, in the ANlPs enough buffer space 
must be provided to accommodate retstat massages for all outstand- 
mg transactions. This is enforced by bounding me number of outstanding 



Transaction Ordering. In this section, we describe the ordering re- 
quuements between different transactions within a single connection. Over 
Afferent connections no ordering of transactions is defined at the transport 
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There several points in a connection where order of transactions can be 
observed (see Figure 8): (a) the order in which the master presents gmd 
messages to the amp, (b) the order in which the CMDs are delivered to the 
slave by the PNXP, (c) the order in which the slave presents the responses 
to the PNI?, and (d) the order the responses are delivered to the master by 
the anip. Note that not all of (b), (c\ and (d) are always present. More- 
over, there are no assumptions about the order in which the slaves execute 
transactions; V*e can only observe the order of the responses. We consider 
the order of the transaction execution to be a system decision, and not a 
part of the interconnect protocol. 

At both anip and pnips, outgoing messages belonging to different 
transactions on the same connection are allowed to bo interleaved. For 
example, two write commands can be issued, and only afterwards their 
data, follows. If the order of Outdata messages differs from the order 
of cmd messages, transaction identifiers must be introduced to associate 
oun>ATAs with their corresponding cmd. 

Outgoing messages can be delivered by the pnifs to the slaves (see 
Figure fi-b) as follows: 

♦ Unordered, which imposes no order on the delivery of the outgo- 
ing messages of different transactions at the pnips. 

♦ Ordered locally, where transactions must be delivered to each 
pnip in the order they were sent, but no order is imposed across 
PNIPS, Locally-ordered delivery of the outgoing messages can be 
provided either by an ordered data transportation, or by reordering 
of outgoing messages at the PNIP. 

■ Ordered globally, where transactions must be delivered in the at* 
dar they were sent, across all pnjps of the connection. Globally* 
ordered delivery of the outgoing part of transactions require a 
costly synchronization mechanism. 

Transaction response messages can be delivered by the slaves to the 
pneps (see Hgore8-c) as follows: . 

♦ Ordered, when RJETData and retstat messages are returned in 
the same order as the CMDs were delivered to the slave. 

♦ Unordered, otherwise. 
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When responses are unordered, there has to be a mechanism to identity 
me transaction to which a response belong*. This is usually done using 
togattachcd to messages for transaction identifications (similar to tags in 

Response messages can be delivered by the A*w to the master (see 
Figure 8-d) as follows: 

• Unordered, which imposes no order on the delivery of responses, 
Bere, also, tags must be used to associate responses with their 
corresponding cmos. 

• Ordered locally, where retdata and retstat messages of trans- 
actions for a single slave are delivered in the order the original 
cmds were presented by the master to the anip. Note that there is 
no ordering imposed for transactions to different slaves within the 
same connection. 

• Globally ordered* where all responses in a connection are deliv- 
ered to the master in the same order as the Original CMDs. 'When 
transactions are pipelined on a connection, then globaHy-ordered 
delivery of responses requires reordering at the ANIP. 

AU3x2x3 = 18 combinations between the above orderings are pos- 
sible. Out of these, we define and offer the following two. An unordered 
connection is a connection in which np ordering is ass umed in any part 
of the transactions. As a result, the responses must be tagged to be able 
identify to which transaction they belong. Implementing unordered con- 
nections has low cost, however, they may be harder to use, and introduce 
the overhead of tagging. 

An ordered connection is defined as a connection with local ordering 
for the outgoing messages from pnips to slaves (Figure 8-b), ordered re- 
sponses ar the pnips (Figure 8-c), and global ordering fox responses at the 
anip (Figure 8-d). We choose local ordering for the outgoing part because 
the gjobal ordering has a too high cost, and has few uses. The orxtorrag of 
responses is selected to aUow a simple progcarnming model with no tag- 
ging. Global ordering at the anip is possible at a moderate cost, because 
all the ordering is done locally in the anip. 

A user can emulate connections with global ordering at the pnips using 
non-pip dined acknowledged transactions. 
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Connection latency, throughput, and jitter. In our network, 
throughput can bo reserved for connections in a time-division multiple ac- 
cess (TDMA) fashion, whore bandwidth is split in fixed-size slots on a 
fixed time frame. Bandwidth, as wdl as bounds on latency and jitter can, 
bo guaranteed when slots are reserved. They are all defined m multiples of 
the slots. 

Guar^nteed-throughjnit connections can overbook resources in some 
cases. For example, when an AM? opens a guaranteed-througlrout read 
connection, it must reserve slots far the read command messages, and for 
the read data messages- The ratio between the two can be very large (e.&, 
1 : 1 00), which leads either to a large number of slots, or bandwidth being 
wasted for the road command messages. 

To solve this problem, we allow the request and response parts of a 
connection be configured independently for all of throughput, latency and 
jitter. Consequently, the request parr of a connection can be best effort 
While the response can have guaranteed throughput (or vice versa). For 
the example mentioned above, we can use best effort read messages, and 
guaranteedrlnronghput reao>data messages. No global connection guaran- 
tees can be offered in this case, but the overall throughput can be higher 
and more stable than in the case of using only best-effort traffic. 

Connection flow control. As mentioned earlier, our network guaran- 
tees that messages arc delivered to the Ni. Messages sent from one of the 
nips are not immediately visible at the other hip, because of the multi-hop 
nature of networks. Consequently, handshakes over a network would allow 
only a single message he transmitted at a timer This limits the throughput 
on a connection and adds latency to transactions. To solve this problem, 
and achieve a better network utilization, the messages must be pipelined. 
In this case, if the data is not consumed at the pnip at the same rate it 
arrives, either flow control must be introduced to slow down the producer, 
or data may be lost because of limited buffer space at the consumer ni. 

We introduce end-to-end flow control at the level of connections, which 
requires buffer space to bo associated with connections. End-to-end flow 
control ensures that messages are sent over the network only when there is 
enough space in the nip's destination buffer to accommodate them. 

End-to-end flow is optional (ie., to be requested when the connections 
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opened^ and can be configured independently for the outgoing and re~ 
turn paths. When no flow control is provided, messages axo dropped when 
buffers overflow Multiple policies of dropping messages are possible, as 
m off-chip networks- Possible scenarios include: (a) fee oldest message is 
dropped (milk policy), or (b) the newest message is dropped (wine pol- 
icy) [24]. 

We opt for a credit-based flow control. Credits are associated with the 
empty buffer space at the receiver Wr. The sender's credit is lowered as data 
is sect, When data is delivered at the receiver nip, credits are granted to 
the sender. If the sender's credit is not sufficient to send some data, the V[l 
at the sender stalls the sending. 



C XTse Cases 

To illustrate the need for differentiated services on connections, we 
show in Ms section some examples of traffic, W© describe the properties 
they would use over an vEthereal connection to meet their traffic require- 
ments. 

Video processing streams typically require a lossless* in-order video 
stream with guaranteed throughput, but possibly allow corrupted samples. 
An ^Ethereal connection for such a stream would require the necessary 
throughput ordered transactions, and flow control. If the video stream is 
produced by the master, only write transactions are necessary. In such a 
case, with a flow-controlled connection there is no need to also require 
transaction completion, because messages are never dropped, and the write 
command and its data are always delivered at the destination. Data in- 
tegrity is always provided by our network, even though it may be not nec- 
essary in this case. 

Another example is that of cache updates which require uncorrupted, 
lossless, low-latency data transfer, but ordering and guaranteed through- 
put are less irnpoitant In such a case, a connection would not require any 
time related guarantees, because a low latency, even if preferable, is not 
critical. Low latency can be obtained even with a best effort connection. 
The connection would also require flow control and guaranteed transac- 
tion completion to ensure loss-less transactions. However, no ordering is 
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necessary, because this is not important for cache updates, and allowing 
out of order transaction can reduce the response time. 



V. Conclusions 

In this paper, we compare networks on chip (NoC) to off-chip networks 
(e.g v computer networks) and existing on-chip interconnects (e.g., busses). 
We show that NoCs have many similarities with off-chip networks. How- 
ever, they also differ, especially in their resource constraints. For example 
on a chip, memory and computation resources are more expensive, while 
there are more wires. This makes NoC architectures different from off-chip 
netwodcs, and requires rethinking of network services. 

We also compare NoCs to existing on-chip interconnects, such as buses 
and switches. By directly connecting ip blocks* existing on-chip intercon- 
nects can offer tight coupling between masters and slaves, and global ar- 
bitration. In NoCs, masters and slaves are completely decoupled, and the 
arbitration is distributed over the network nodes, This make it harder to 
provide guarantees, such as bandwidth lower bounds, and transaction or- 
derings. 

We define a set of NoC services that abstract from the network details. 
Using these services in the IP design decouples computation and communi- 
cation. We use a request-response transaction model to be close to existing 
on-chip interconnect protocols. This eases the migration of current CPs to 
NoCs. Tb fully utilize me NoC capabilities, such as high bandwidth and 
transaction coricinrency, our services provide connection-oriented com- 
munication. Connections can be configured independently with different 
properties. These properties include transaction completion, various trans- 
action ordering, bandwidth lower bounds, latency and jitter upper bounds, 
and flew control. . 

Our services are a prerequisite for service-based system design which 
makes applications independent of NoC implementations, makes de- 
signs more robust, and enables arclntecture-independent quality-of-service 
strategies. 
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Abstract 

Continuing VLSI technology scaling raises several deep 
submicran (DStf) problems like relatively slow intercon- 
nect, power dissipation and distribution, and signal in- 
tegrity. Those problems Ore encountered particularly on 
Ions Wires for global interconnect As clock frequencies in- 
crease, scaled wires become relatively slower, and on-chip 
communication will be the limiting performance factor of 
future chips. We explain why efficiently sharing of the wires 
forlongdi&tancecammunicationisihe solution to this prob- 
lem. We introduce networks on silicon (Na$) t that route 
packets aver Shared (seml)~global wires. NoS performance 
is expected to be high, but comes at a cost. Balancing the 
performance and cost of a NaS is a major challenge, and 
wo believe busses still have a role play. 



1 Technology trend 

VLSI technology scaling has long followed Moore's Jaw. 
No fundamental barriers nave bean identified that invalidate 
mis law for at least another decade [12J. Moore's law pre- 
dicts that chips in 2010 will count oyer 4 billion transis- 
tors, operating in the muKi-GHz range. This abundance of 
transisfors will make very complexsystems on silicon (SoS) 
possible. 

However, challenges at all abstraction levels of design 
will have to he addressed before such $o$s will become a 
reality. The three most important deep submicron (BSM) 
challenges, related to all abstraction levels, are: sub-scan tial 
wire delay, controlling power delivery and dissipation, and 
assuring signal integrity 

Until recently, on-chip wiring was cheap. Consequently 
architectural models have been employed that relied on low- 
latency communication to globally share expensive compu- 
tational resources. Global wire delay stays at best constant 
ander technology scaling and hence these wires become ef- 
fectively slower compared to a gate delay. For example, 
for 130 nm technology the reachable distance of a repeated 
global signal in a clock cycle is no more than the length of a 




Figure 1. THe number of 50k blocks for future 
process teebnofoflifls. 



chip [41- For SO nm technology; crossing a chip with highly 
optimized interconnect takes between six and ten clock- 
cycles, clearly invalidating the lew-latency assumption of 
today. Hence we must move to system-level architectures 
that scale with technology. 

A feasible template feir a ftiwre-pmof architecture is con- 
structed from processing nodes that do not grow in com- 
plexity with technology. Instead, as technology scales, the 
number pf ^ i0&0 processing nodes on the chip grows. An 
on-chip communication network then combin es these nodes 
into a SoS [4], 

Various publications show that the spanning wires in 
blocks of 501c gates scale with technology f4, 13). This 
means that the afcvementionerlTOM issues can be handled 
by CAP tools, assuming their evolutionary improvement. 
Hgure 1 shows the exponentially increasing amount of such 
SOk blocks for a large die in subsequent technologies; in 
35 nm this nnxnber is approximately ten thousand (adapted 
from [13} and [4]). It remains to find a communication ar- 
chitecture that allows a SoS composed of these blocks co- 
operate efficiently, 

2 Networks on silicon are inevitable 

Given the growing demand for and impact of intercon- 
nect on system cost and performance, it is worthwhile to op- 
timize the utilization of wires. Ad-hoc global wiring strm> 
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lure? often lead co a huge number of wires with an aver- 
age usage as tow as 10% in time [2]. To control cost in 
this scenario, the wire packing density must be very high, 
which is not beneficia] for the power find delay characterise 
rice. Efficient mechanisms for sharing (samp-global wires 
must splvc this co&t-psrtbrjnancfi dilemma. 

In deep submicron technologiss, <semi>£lobaI wires 
need special attention for power, signal-integrity, and per-* 
foflnance reasons, Tn the discussion below we show how 
special circuit techniques can handle these issues. Such 
techniques only work, however, when embedded in ded- 
icated communication IP, which provides a more abstract 
interface. 

Power is an issue for global interconnect because it costs 
more energy lo send a bit of information over longer the 
wires. Tb reduce the communication delay, the energy con- 
sumption increases due to bigger drivers. Employing low- 
swing signaling for the global wires saves up to a factor four 
in power for those wires [15]. Irnplementxng low-swing sig- 
naling requires special circuit techniques. 

Signal integrity is hampered increasingly by growing ca- 
pacitive and inductive coupling between wires. Capadtive 
noise coupling is the result of the large aspect ratio of wires 
in DSM technologies. Inductive noise coupling becomes 
more of & problem due to the decreasing transition times. IR 
drop 1 in the supply distribution increasingly contributes to 
the noise. The most effective way jo make a connection ro- 
bust agQiost noise- is application of differencial signaling f7j. 
Differential signaling improves both the generation of and 
sensitivity to noise. 

The signal propagation delay of en uninterrupted wire 
grows quadratically with its length; hence from a certain 
length onwards it H advantageous to partition the wire in 
segments with repeaters in between. The repeater insertion 
technique improves bandwidth and latency but at the cost of 
higher power consumption. Wire delay can be reduced by 
far wires with a lower resistance per unit length at the cost 
of lower wire density. Such wires behave like lossy trans- 
mission lines and require drivers with a resistance matched 
to the transmission line. 

As a result, we believe that all inter-block communica- 
tion will be implemented by hard-macro transmitters and 
receivers, employing low-swing differential signaling, with 
wen-controlled interconnect Misread of ad-hoc drivers han- 
dled by standard place-and-route tools. In this way, commu- 
nication links can be realised with predictable performance 
and DSM robustness. 

Currently, the prevalent on-chip interconnects are 
busses [1]. In a bus architecture, devices share a single 
transmission medium to communicate. At a given time, 

1 Supply voltage dxopa ere «uscd by high amenta 0) flwtag through 
the ttslSTOK* (R) Of the supply perworic Since the supply VQtego reduces 
under pcstfii>g IR drop worsens. 
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Figure 2, Structural viaw of a network on sil- 
icon consisting of processing nodes (P) and 
nodes supporting communication (R, B). 



only one device has access to the shared medium, An ar- 
bitration mechanism is required to order simultaneous ac- 
cesses. Such functionality is typically performed by a cen- 
tralized bos arbiter: The performance of a shared-medium 
bus scales badly. For an increasing number of bus clients 
CO individual clients get less bandwidth on average, and (n) 
increased capaciiive loads and wire length decrease the total 
bandwidth. 

A solution that pairs scalable communication perfor- 
mance and minimal interconnect cost is expected from net- 
works on silicon (NoS) where the $oS is considered as a 
network of components {%, 3, 1]. Figure % illustrates the 
hardware architecture or this concept The outer compo- 
nents (marked P) exclusively perform processing and stor- 
age functions, whereas the inner components (marked B and 
R) form the NoS and cater to communication needs of the 
outer components. The basic building blocks of a NoS are 
routers (R), 

A router forwards data from its input ports to its out- 
put ports in a concurrent fashion. To chat end, a router of 
arity N contains a iV x JV switch matrix. Data packets 
make their way through the network based on the routing 
information in their headers. A link between two routers is . 
implemented by a point-to-point connection. The links typ- 
ically span medium to long distances ranging from several 
to over more than twenty millimeters. The actual length de- 
pends on the chosen topology of the network. For a mesh 
topology the links are relatively short, for a torus which is 
a mesh with wrap-around connections, some links have a 
length of half the edge of the chip. Links can be optimized 
for bandwidth, latency, power, or a combination of these, 
depending an performance requirements. 

3 NoS Jneqirirements 

An important, characteristic of a future system-level ar- 
chitecture is the separation between computation and com- 
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munic^tion. A NoS allows the computational blocks to 
communicate with one other via a uniform mterfkee. A 
unifbim mterface is advantageous because (i) ft frees the 
core developer from having to make asstmmtions about the 
system in which the core will be used, and Ci) does not 
constrain die development of newer communication archi- 
tectures by detailed mtcrfacing requirements Qf paitfcujar 
legacy SoC components T61 Several on-chip bus standards 
are evolving to realize this goal, most notably VC{, put for- 
w^V^SIA [J4J, and more recently, the Open Core Pro- 
Tie fundamental aim of a NoS is to provide flexible and 
efficient e^nmmnication between die thousands of IF blocks 
in a system, with performance guarantees. In a typical SoS, 
tho communication demands of different IP blocks show 
large variations. For example, data rates may bo constant 
(e.g. digital video) or variable (e,g. compressed video). The 
importance of latency and jitter also varies greatly. Finally, 
the data granularity may range from single words to large' 
blocks, A NoS should be able to offer different services to 
different clients. Each service class must be mmlomented 
efficiently, using a shared uniform infrastructure, 

A high utilization of the network: comes at a price- "vYhcn 
the network starts to saturate, throughput and latency will 
show huge variations, which is not acceptable in real-time 
applications. Hence, the network should also provide guar- 
antees, like loss-loss dam transport, minimal bandwidth, 
and bounded latency, The way packets are buffered and 
scheduled in routers, and the effects on performance guar- 
antees has been the subject of intense research. Funda- 
mentally, sharing and guarantees are conflicting, and effi- 
ciently combining guaranteed traffic with best-effort traffic 
is hard [1 1]. Although best-efrort services are cheaper than 
guaranteed services we believe that the latter ore essential 
because they enable compositional and scalable integration 
of the IP blocks [5J. It is up to the IP integrator at design 
lima, and up to tho application at run rime, to make a trade 
off. 



4 Performance and cost analysis of NoSs 



The vision of previous sections is that the design of fix- 
ture SoSs will allow IP blocks to be plugged in at will to 
minimize ccrm muni cation costs, but without today's prob- 
lems like riming closure. In this section we investigate the 
cost implications of system design based on a NoS. We hope 
the vision comes at acceptable cos*. "VVe nope that the over- 
all cost of a NoS, including the full protocol stack to use it, 
mm out to be acceptable such that the integration blessings 
of NoSs do not change into a cost lUgntmara. 
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4.1 Performance 

Ttea^eateban 
bandwio^perport, BWpon* the arity of the router (number 
of ports). AT, and a utilization meter, or < l corresponding 
to the router arbitration scheme. 

BWnvitr = a TV BWptjt (1) 
We discuss each in fum The bandwidth per port is deter- 



In short 

W P on <= B ^(BW^BW^^^) (2) 
where 5 is tho width of the data path. The combined band- 
width of the B wires erf a link is a function of tho layout 
characteristics (eg. total length), chosen signaling tech- 
nique, and the budgets for power, delay, and area, A first- 
order expression for the bandwidth ofarepoated global wire 
optimised forpower-delay is 

where F04 is the delay of an inverter driving four equally 
steed inverters [4]. In a lOOnm technology, this yields 5 
Gb/s per wire under worst-ease environmental conditions. 
Notice that the bandwidth of repeated global wires scales 



at the segments. 

Rinirimgraeisniterdara An 
aggressive but realistic frequency is 1.25 GHz correspond 
mg the clock frequency of 50k gates blocks J4J. The critical 
/unction in the dat4 path is the JV x JV switch, For JVup 
to 20 it meets the 1.25 GHz data rate, using N 1-out-of-JV 
multiplexors. Ins relaxed demand on the wires of the link 
can be used to reduce power dissipation and area. 

The utilisation factor, or, reflects tho effectiveness of 
the router to resolve contention on the links. The queu- 
ing strategy, the queue sizes, and the schedule algorithm all 
strongly influence a. Accordingly, many evening policies 
and scheduling algorithms have been resented in the liter- 
ature. For example, a = 0.59 for infinite fifo input queues 
with rinifcrm and independent traffic. (Virtual) output queu- 
ing gives Or » 1 under the same conditions, but at the 
cost of larger queues and a more complex scheduling algo- 
rithm [8]. Static scheduling techniques like (tirne-cHvision- 
multiplexed) circuit switching can also improve the utfltsa- 
tion factor. 

Hence, in 100 nm technology, the bandwidth of a 32 bit 
router port is appxoxirnately5 GBytc/sec, 

4*2 Cost 

Three main components contribute to the area cost of a 
router: the switch, the control logic, and the packet queues. 
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Tho switch allows AT simultaneous connections from the 
W inputs to iho N outputs which results in 5 arrays of JV*x 
JV wires, giving rise to an OfiV 2 ) area cost. 

Tho control logic of a ranter is made up of tho switch - 
matrix schedule unit find other configuration logic. The 
delay of a schedule cycle varies greatly per algorithm 
(for example, for virtual output queuing from 0(1) to 
OfivV 2 ) (9J); it is important for two reasons. First, it de- 
termines the lower bound for latency mat a flit 2 incurs to 
traverse the ronton Second, it affects the size of the queues. 
The longer a schedule cycle, the more data arrive, given a 
fixed bandwidth of a port BWpoH- This leads to deeper 
queues, and higher area cost 

The three aforementioned queuing strategies require 
queues of size 0(N) to 0[N*) flirs. Scheduling algorithms 
perform better with deeper queues, with a decreasing rtturn. 

Besides routers, a significant amount of area is consumed 
by so-called network interfaces (NI) modules. These mod- 
ules translate the IP transactions for a given connection to 
packets that arc sent over the network, and vice versa. Pack- 
ets can be sent once the payload has been completely ac- 
cepted by the NI Hence, the buffers must be dimensioned 
such that, at least a complete packet for every simultane- 
ously active connection can be stored. 

The trade off between utilisation a and the coat is a com- 
plex one, but of importance to the viability of NoSs. 

5 The future role of busses 

In sections 1 and 2 we have argued that NoSs are essen- 
tial to solve SoS integration in a scalable fashion. While 
Section 4£ raised some general cost issues, we will now 
more concretely consider the trade off between busses 
and NoSs. Will packct-T^itched NoSs completely replace 
current busses (n future SoSs, or will a hybrid approach 
emerge? We believe that shared busses may have a role 
to play in first-level commniiicarjon (B in Figure 2) for the 
following reasons. 

Hrst» typical IP blocks uoderutilize the bandwidth ca- 
pacity of an individual router port All router ports offer the 
same bandwidth that is inherent to the architecture, whereas 
the bandwidth requirements Of IP blocks varies greatly. A 
shared memory module needs typically much higher (peak) 
bandwidth than a streaming peripheral device. Single word 
transfers, variable bit rates, bursty 10, and much lower clock 
rates for IP blocks than for the NoS further waste band- 
width. This means that the communication needs of a num- 
ber of IP blocks can be aggregated using a bus before (ho 
capacity ef a network link is reached. 

Second, network interfaces are more expensive (in terms 
of area) than a bus adaptor; Using a bus as a first-level traf- 

2 FEl Stands for flow CQtUrti ffifixc, ft* atomic pomoa of date handled 
per schedule cycle. A. packet Is deeampaafld in flits. 
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Figure 3. A srja red-medium b us seems a cost* 
effective way to connect the IP to the packet- 
switched network. 



fie concentrator, trading bus adaptors for network interfaces 
thus reduces the overall cost of IP-No S interfacing. We ex- 
pect that tho overhead of a bus and its network interface are 
Outweighed. 

Finally, the number of routers is reduced significantly 
when busses are used as the first-level interconnect. Routers 
are larger than busses due to their packet queues and more 
complex scheduling; We give an example below- 

An example of the heterogeneous communication archi- 
tecture is depleted in Figure 3. A router of arity three sur- 
rounded by twelve IP bocks is shown. Two shared-medium 
pusses, each connected to six 50k gates IP blocks, commu- 
nicate with the router via two network interfaces. These 
have two functions: first they schedule the transactions on 
the bus, and second they given the bus clients access to the 
packet-switched network. Trie third port of the router pro- 
vides communication to me remainder of the network. Eig- 
ura 4 shows an architecture using only routers. Now three 
routers of arity five and one of arity four are needed. 

The suggested shared-medium bus has a length of 35kA, 
where X is half of the length of a minimal transistor. Global 
wires of this length win not be the bottle-neck of bus per- 
formance. 3 

Hie feasibility of hybrid NoSs hinges on the right imple- 
mentation of the busses.' First, they must be shared wires, 
as opposed to switches. Second, their arbitration must be 
combined, or at least compatible with, the scheduling taking 
place in the network interfaces, to offer uniform end-to-end 
network services. 

Wo see a future for hybrid NoSs, with first-level commu- 
nication over a shared-me diumbus, and the higher levels us- 
ing a packet-switched network, perhaps a packet-switched 
network can bo seen as a distributed and scalable implemen- 
tation of a logical bridge that connects all the local busses of 
(he SoS, Deciding how many IP blocks can use a local bus 

^^lmuin-dcky wirp segments have, a ten^ Of 23kX, wi^sesmfinK 
Optimized ftor pewcs*dcfciy product havd u Icngdi of 48kX» These feogrfys 
fittb with technology lite tfao edge of 50k btecks £4]. 
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Figure 4. IP fa IP communication based on a 
homogeneous router network. 



before connecting to the rower network is a question that 
must be answered foremost. 

6 Conclusion 

We have argued in Section 1 thai future systems on sili- 
con (SoS) will be composed of large numbers of process- 
ing nodes (or IP blocks). Each processing node is rela- 
tively small (50k gates) to scale with technology, and can 
be handled by CAD tools, assuming their evolutionary im- 
provement. The interconnect and communication between 
these blocks then becomes m essential function in itself 
(Section 2), leading to networks on silicon (NoS). A NoS 
is based on packet switching to flexibly share link capacity 
between the network clients, and to provide pluriform com- 
munication services over a uniform infiastruoure. Both ef« 
ttoensy, provided by beat-effort traffic and predictable per- 
formance, such as guaranteed throughput and latency, arc 
important (Section 3 ), Efficiently combining them is a chal- 
lenge. Section 4 showed that the performance of a^oS de- 
pends on many factors, but Is expected to be high. The cost 
of a NoS can be staieo" in terms of area (routers, network in- 
terfaces), utilization of wires, and speed (latency). They can 
be traded oft against one another, but also, perhaps more in- 
cexcstmgly, against the cost of busses, A hybrid NoS using 
shared-wire busses to communicate locally, and accumulat- 
ing traffic for a core router network is a promising architec- 
ture that deserves to be investigated. 
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ABSTRACT 

Managing the convexity of designing chips containing bftHosB of 
tranftffltoTs requites decoupling computation from conrtroinicati on, 
For the communication, scalable and compositional interconnects 
Csuch ad networks on chip (NoC)) must be used. In this paper we 
show that guaranteed services are essential in achieving this de- 
coupling. Guarantees typically come at the cost of inefficient re* 
source nrflftaflfm. To achieve efficiency, they must be used in com- 
bination with best-effort services. We describe a NoC architecture 
that efficiently combines guaranteed and besfreffort services. The 
icy element of our NoC is a router consisting conceptually of two 
parts: the so-called guaranteed throu ghput (Gt) and best-e£fr&t (be) 
ranters* Both oner dam integrity, lossless ami in-order data deliv- 
ery. Additionally, foe GT router offers guaranteed throughput and 
latency services. We combine the GT and b B router architectures ef- 
ficiently by sharing router resources, enabling high link utilization. 
The guarantees are never affected by the be traffic, and links ore 
efficiently utilized because BB traffic uses all bandwidth left over 
from OT traffic* Connections are programmed using BB packets. 
The pragroinrmns model is robust, concurrent; and dlstribnted It 
flnqh^-g run-time and compfle-time, deterministic end adaptive con- 
nection management: For an our architectural choices, we snow (he 
trade oris between hardware complexity and efficiency, and moti- 
vate our choices. 

1. INXRODUCTION 

Recent advances in technology raise the challenge of managing 
the complexity of designing chips containing billions of transistors. 
A key ingredient in tadding tins challenge is decoupling the 
putationfrom commuraapion [9, 15]. This decoupling allows IPs 
(the computation part), and the interconnect (the communication 
part) to be designed independently imm each other. 

In this paper, we focus on the communication part. Existing in- 
terconnects (e,g», busses) may no longer be feasible for chips with 
many IPs, because of the diverse and dynamic communication re- 
quirements. Networks an a chip (NoC) ere emerging as an alter- 
native to existing on-chip interconnects because they (a) structure 
and manage global wires in new deep-subnticron technologies [2, 
3, 4, 6] t (b) share wires, lowering their number and increasing their 
utilization [4. 6], (c) can be energy efficient and reliable [2], and 
(d) are scalable when compared to traditional busses [7]. 
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Decoupling the computation from communication requires that 
services that IPs use to communicate (a) are well-defined, and 
(b) hide the implcmerimtion details of the inrexramncct [9] . see. 
Figure. 1(a). NoGs again help, because they are traditionally de- 
signed using layered protocol stacks [14], where each layer prow 
vices a well-defined interface which decouples service usage from 
service implementation [15, 3]. 

In particular, guaranteed services axe essential because they 
make the reanirenienm on the NoC explicit, thus limiting the possi- 
ble interactions (a stricter contract) of IPs with the communication 
environment* As a result, IP design is simpler. IPs can also be 
designed independently, because thdr guaranteed services are not 
affected by the mtercormeet or by other IPs. This is essential for a 
compositional construction (design and progranuning) of systems 
on chip. Moreover; for guaranteed services, fiailnres are restricted 
to the IP configuration phase (a service request is either granted or 
denied by the NoC) which simplifies the IP progruxnming mode [61. 
We view the guaranteed services to be offered by an iritercoimocfc 
as areonirexnent rrom the applications, see Kgure 1(b). 

The drawback of using guaranteed services is that they require 
resource reservation for worst-case scenarios. As a consequence, 
resources may not be efficiently utilized, which may hot be ac- 
ceptable in a system on a chip where cose constraints are typically 
very tight; see Figure 1(c). To overcome this problem, best-effort 
services can be used for less critical communication requirements 
to rally utilize the available resources. Using best-effort services, 
however, provide no guarantees. 

A comprcsmse between using guarantees only and having an ef- 
ficient interconnect is to combine guaranteed and best-effort ser- 
vices. Guaranteed traffic should not be directed by best^effon traf- 
fic, while best-effort name may use all the resources not used by 
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Figure 1; Network services (a) hide the interconnect details and 
allow reusable components to he build on top or thorn, (b) are 
driven by the application requirements, (c) their efficiency re- 
lies on technology and network organization, and (d) are build 
nsmg a layered approach- 
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guaranteed traiKc, Guaranteed services would then be used for the 
critical fraffic requirements, and best-effort services for non-critical 
traf^c rtquirorrjeats. 

In tins paper, we first Est a set of netweofc-inc^pendent 
mcadon services that are essential in Chip design. In the following 
sections, we show the trade*offe between efficiency and cost that 
we make in our NoC, In Section 3. we present the trade ofik and 
take decisions on network-related issues. In Section 4, we zoom 
into (he internals of the fey component of our NoC: a footer which 
efficiently combines guaranteed and best-effort services. 

2. SERVICES 

The increasing complexity of integrated circuits, and the strong 
time-to-marftet pressure require modular designs and IP reuse. De- 
coupling computation from comnujnteaiion in chip design serves 
both these two requirements [9]. This decoupling is realized by 
denning communication interfaces that provide well-defined ser- 
vices and hide the nrmlemenr^rion details of the interconnect. 

We show in Section 1, that guaranteed services are essential to 
simplify IP design and integration. Examples of such guaranteed 
services are data integrity, which assumes the data is delivered in> 
corrupted, lossless data delivery, which means no dam is dropped 
in ths interconnect, in-order data delivery, winch, specifies that 
the order in which data is delivered is the same order in which it 
has been sent. Other guarantees offer timeoelnted bounds, such as 
throughput and latency* 

Guarantees require resource reservation for worse-ease scenar- 
ios, which can he expensive. For example, guaranteeing through- 
put for a stream of data implies reserving bandwidth for its peak 
throughput, even when its average is ranch lower. As a conse- 
quence, when, using guarantees, resources are often underutilized. 

Resources are better utilized when beat-effort prsffic is used. 
Best-effort services do not reserve any resources, and hence provide 
no guarantees. As a consequence, cheir performance is dictated by 
boundary conditions, such as interconnect load. For example, a 
connection may become teinporariJy lossy in a congested network, 
if the network resolves congestion by dropping dam. 

Best-effort services nee resources well because they are typically 
designed for average-case scenarios as opposed to Worst-case sce- 
narios. They are also easy and fast to use, as they require no re- 
source reservation. Their main disadvantage is tfceir unpredictabil- 
ity: one cannot rely on a given performance they do not offer 
guarantees): In the best case; if certain boundary conditions are 
Kreumfid, a statistical performance can he derived. 

The requirements for guaranteed services and the efficiency con- 
strami (good resource utilization) are conflicting. But a first stop 
to a predictable and low-cost interconnect is combining the guar- 
anteed and best-effort services in the same interconnect Guaran- 
teed services would be used for critical traffic reouirements, and 
best-effort services for no& critical traffic requirements. For exam- 
ple a video processing IP will Typically require a lossless, in-order 
video stream with guaranteed throughput* but possibly allows cor- 
rupted fiarnples. Another example is cache updates which require 
unconupxed, lossless, low-latency data transfer, but ordering and 
guaranteed throughput are less important. In Section 4*3 we show 
how combining guaranteed and best-effort services efficiently uses 
common resources. In the remainder of this section we analyze the 
mittfrvmiTi level of abstraction at which the communication services 
must be offered to hide tho network internals. 

Traditionally,, network services have been implemented and of- 
fered using a layered protocol stack, typically aligned to the ISO- 
OSI reference model [14]. NoCs also take this approach (2, 3, 6, 
15]> because it strucrures and decomposes the service implementa- 



tion, and the protocol stack concepts aid iwsftgf^gjofiQggces. 

To achieve the decoupling of computation from commumcatlou, 
fee Ctfiiuuumcatfon services must be offered at least at the level 
of the transport layer in OSI reference model. It is the first lay er 
that offers end-u>end services, hiding the network details: see Fig- 
ure 1(d) [3J. ^ 

The lowest three layers in the protocol stack, namely physical, 
data-link and network layers, ore network srjecific. Therefore, these 
services Should not be visible to the IPs if decoupling between com- 
putation from cornmunicatton is desired. However, these layers are 
essential in implementing the services, because constructing guar- 
antees without guarantees at the layer below is cither very expan- 
sive, or even impossible. For example, implementing a lossless 
commnnication on top of a lossy service requires acknowledgment, 
data retransmission, and fflperins duplicated data. This leads to a 
significant increase in traffic, and also a trade out between large 
buffer space rec^riremems and long delays. Even worse, providing 
guarantees for nine-related services is impossible if lower layers 
do not offer these guarantees. For example, throughput can not be 
guaranteed if ccrnmunicaticn at e lower layer is lossy. As a con- 
sequence, guarantees can only be built on iop of guarantees, sea 
Figure 1(b). Similarly, a layer's efficiency is based on efficient im- 
plementations of the layers below it, see Figure 1(c). 

Hie NoC services mat we consider essential for chip design are: 
data integrity, lossless data delivery, in-order delivery, throughput 
and latency. Dam integrity is always guaranteed. AH the other 
services can be guaranteed or not, depending on request. In the 
next section, wo describe briefly how these services fire provided 
by our NoC and in Section 4 we describe in detail how our router 
architecture enables an efficient implementation of these services. 

3. NETWORKS ON CHXP 

Currently, the prevalent on-chip interconnects are busses and 
switches [1Q]. These are single-hop interconnects, meaning that 
there is no storage in the interconnect itself. Scalable u^rconnects 
require multiple hops with storage in every hop (router). Tnis in- 
troduces a number of new issues, which we discuss in this section. 

General computer network research is a mature research 
field [161 which has many issues in common with NoCs. How- 
ever, two significant differences between computer networks and 
on-chip networks moke the trade ofrs in their design very differ- 
ent 14}. First, routers of a NoC are more resource constrained than 
those in computer network, m particular in the control complexity 
and in the amount of memory. Second, communication links of a 
NoC are relatively shorter than those in computer networks, allow- 
ing tight synchronization (network flow control) between routers. 

These two characteristics have a direct impact on the NoC ser- 
vice uin^ementatiou. In a NoC, it is possible to solve the data in- 
tegrity at (he data-Unk layer as a low cost We, therefore, assume it 
solved at the network layer and higher. Lossless transport of data 
is guaranteed by our routers. However, to allow consumers slower 
than producers, the network may be allowed to drop data at its edge. 
Consequently, the designer may choose either for (a) a lossless con- 
nection (i.e-„ implementing end-to-end flow control), or (b) a lossy 
connection (La, without now control). In-order delivery is again 
guaranteed by our router (Le*, routers do not reorder data between 
a given input port and a given output port). £ncVto-end ordering 
of data, however, has to be provided on top of this at the network 
edge when data is transported on different routes witn. different de- 
lays. Offering guaranteed and best-effort throughput and latency 
services is also implemented by the routers. These router services 
together with the programming model explained in Section 4.3.2 
offer network throughput and latency services. 
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We identify four important issues in the design of the rooter net- 
work architecture. These are: the smiching mode, routing, con- 
tention resolution, and network jioyy control Equally ixuportant, 
end-to-end fow control and congestion control are handled in our 
NoC at the network edge instead of the routers; we meref ore omi t 
(hear discussion hero. 

3.1 Switching Mode 

The switching mode of a network specifies how data and control 
are reflated We distinguish, circuit switching packet switching' 

In drcmi switching data and control ere separated. First the con- 
trol is provided w the network (connection set up\ This results in 
a circuit over which an subsequent dam of the connection is trans- 
potted. In titn&divisian switching bandwidth is shared by time- 
division multiplexing connections over circuits, Circuit-switched 
networks inherently offer tuna-related guaranteed services when 
resources are reserved during the connection set up- 

In packet switching data i s divided into packets and every packet 
ia composed of a control part (the header), and a data part (the pay- 
toad). Network routers inspect, and possibly modify, the headers 
of incoming packets to switch the packet to the appropriate out- 
put port. Since in packet switching the packets are self contained, 
there is no need for a set-up phase to allocate resources. Best-effort 
services are therefore naturally provided by packet switching. 

3.2 Routing 

Routing is the determination of the route (or path) that the data 
follows from source to destination. There are two basic approaches: 
source Touting and destination routing. In source routing, the net- 
work interface at die source computes the complete route to the 
destination. In destination routing, only the network address of 
the destination is specified, and every rooter selects the appropriate 
output based on the address, We refer to [17] for several classes of 
routing functions. 

In circuit switching, routing takss pkee fir conneetiGii set up, Le,, 
once for all data in that connection- In packet switdung, routing is 
done for every individual packet sent over die network. In both 
cases, source and destination routing are possible. We c urrently 
consider source routing because it is independent of the router net- 
work topology, which is not yet determined. 

3*3 Contention Resolution 

When a router attempts to send nndtiple data items over the same 
link at the same time contention^ said to occur As only one dam 
item can occupy a link at any point in time a selection a mong the 
contending data must be made; this process is called contention 
resolution. Three approaches exist: avoiding contention, dropping 
data Cone of the contending data item is transmitted and the remain- 
der are deleted), and scheduling (or sequentializing) data (all data 
items axe sent in turn; some data items are therefore delayed). 

In circuit switching contention resolution takes place at set up 
at the granularity of connections* so that data sent over different 
connections do not conflict. Thus, there is no contention during 
data transport, and time-related guarantees can be given. 

In packet switching contention resolution takes place at the gran- 
ularity of individual packets. Dropping packets is possible, but for 
a lossless service (a) it adds complexity to the network (acknowl- 
edgments, retransmission, etc.), and (b) ic ultimately Increases the 
traffic because dropped packets need tp be regent- Thus, sc he du li n g 
data is the only remaining option. 
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Network fiow control, also called routing mode deals with the 
limited amount of buffering in routers and data acceptance between 
routers. In circuit switching connections axe set Up. The data send 
over these connections is always accepted by the routers and hence 
no network flow control is needed. In packet switching, data must 
be buffered at every router before they are sent on. Because routers 
have a limited amount of buffering they accept data only when they 
have enough space to store the inconusg data. 

There arc three types of flow concrol, namely store and forward* 
virtual cut-ihrough, and wormhole routing. In store- and-forward 
routing, an input packet is received and stored in its entirety before 
it is forwoided Co the next router. This requires storage for the com- 
plete packet, and unplies a per-router latency of at least the tuns 
required for the router to receive the packet. 

In virtual cut-through routing a packet is forwarded as soon as 
the next router guarantees that the complete packet will be ac- 
cep ted. Only when no guarantee is given, the whole packet is stored 
in the router. Tuns, virmal cut-trough routing requires baffler space 
for a complete packet, like store and forward routing, but allows 
lower-latency communication, 

la worrohole routing packets are split in so-called JSb (flow con- 
trol digits). A flit is passed to the next router when that router 
accepts that flic, even when there is cot enough buffer space for the 
annplete packet. As soon as a flit of a packet is se^ over an ounjut 
port, that output port is reserved for flits of that packet only. When 
the first flit of a packet is blocked the trailing flits can therefore 
be spread over multiple leutera, blocking the intermediate links. 
Wonnhole routing requires die least buffering (buffer flits instead 
of packets) and also allows low4atency co rnrminj ca t iofl. However, 
it is more sensitive to deadlock and generally results in lower link 
utilization than virtual cut-through routing. 

We opt for wonnhole routing because it offers low latency, which 
is one of our targeted services, and because it has the lowest cost in 
terms of buffering, which is expensive on-chip. 

4* A COMBINED GT-BE ROUTER 

Section 2 defines our requirements for NoCs in terms of services 
that are to be offered, m particular; both guaranteed and best-effort 
services. The previous section introduces a number of general net- 
working issues mat will be built upon here. In the following two 
subsections we show that the guaranteed and best-effort services 
- can conceptually be described by two mdependent router architec- 
tures. The combination of these two router architectures ia effi- 
cient and has a flexible programming model, as described in Sub- 
section 43. 

4JL A GT Router Architecture 

Our guaranteed-ctaongnput (gt) center must guarantee uncor- 
rupied* lossless and ordered data transfer, and both throughput and 
latency over a finite time interval. As mentioned earlier, data in- 
tegrity js solved at the data-link layer: we do not address it further. 
No data is dropped by the gt router because we use a variant of cir- 
cuit switching (described in the next section). Dam is transported 
in fixed-size blocks, farther explained below. As only one block 
is stored per input in the GT router, blocks remain ordered. We 
now bun to the more challenging tirne-related guarantees, namely 
throughput and latency. 

4.1, 1 Time-related Guarantees 

Latency is as the time a packet spends In the network. 

Guaranteeing .latency, therefore, means that a worst-case upper 
bound must be given for this tuns, Here we define throughput 
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for a given producei^cctfstmier pair as the amount of daia trans- 
ported by the network over a finite, fixed time interval. Guarantee- 
ing tiiroughpjit means giving a lower bound. 

Wo observe that guaranteeing latency in a lossless router is dif- 
ficult because contention requires schedntfng and hence delays. 
Guaranteeing throughpnt is less problematic Rate-based pacto 
switching (for an overview see [18]) offers guaranteed toughpat 
ovoraumteperio^andbeittealaleiic^ 
high, however, and toe cost of binTcring is also high. Deadline- 
based packet switcMng [13] offers preferential treatment for pack- 
ets close to their deadline. This allows differential latency guaran- 
tecs (under certain admissible traffic assumptions), but also at high 
birffer costs. 

Circuit switching solves the contention at set up, so naturally 
providing guaranteed latency and thrrmgrmgL CircniCB can be 
pipelined to improve throughput [5], at the cost of additional 
bartering and latency. Tr^nc-drvision multiplexing connections over 
pfpeKned circuits additionally offers fleodbiEry in bandwidth allo- 
cation. This requires a notion of router synchroniciiy, which is pos- 
sible because a NoC is better corarcJlahle than a general network. 
We explain this variation in more detail in the ncxr subsection. The 
associated rirogramrning model is described in Section 43.2. 

4.1.2 Contention-free Routing 

A water uses a slot tabte to (a) avoid contention on a link, (b) 
divide up bandwidth per link, and (c) switch data to the correct out- 
put. Every slot table jR has 5 fixed-size rime slots (rows), and JV 
router outputs (columns). There is a logical notion of synimreruc- 
ity; all routers in the network are in the same Slot. In a slot q at 
most oue block of data can be read/Write per rapm/outpnt port The 
next slot {a+l)%3 % the read blocks are written to their appropriate 
output ports. Blocks thus propagate in a store and forward fashion. 
The latency a block incurs per router is equal to the duration of a 
Slot, Bandwidth is guaranteed in multiples of block size per 3 slots. 

The enfrips of the slot able map outputs to inputs for every slot: 
R(s, o) = i. An entry is empty, whan there is no reservation for 
that output in that slot. No contention arises because there is at most 
one input per output Sending a single input to multiple outputs 
(multicast) is possible. 

The slots reserved for a block along its path {com source to desti- 
nation increase by one (mocUJlo ^. If slot^ isi^ervedinaronter, 
slot (* 4- 1)%S must be reserved in the next router on the path. 
* The assignment of slots to connections in the network is an opti- 
mization problem, and is described in Section 43.3. Section 433. 
explains how slots arc reserved in the neiworSv by means of best- 
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4 J A BE Router Architecture 

Best-effort 0813) traffic can have a better average r«ribrmance 
than offered by guaranteed services. This depends on boundary 
conditions, such as network load, that are unpredictable, Best- 
effort services thus ftOflH our efficiency retirement, but without 
offering time-related guarantees, This section describes an archi- 
tecture for a best-effort service with U11 corrupted, lossless, in-order 
data transport 

Th& muter efficiency is influenced by both its complexity and 
its utilisation, th Section 3 we have justified our choice for rout- 
ing (source routing) and network flow control (wormhole). Now 
we determine the contention resolution scheme that is used. Ic h as 
two components: buffering and scheduling. Our router prototypes 
show that the buffering costs dominate the cost of the router. The 
main trade off (n Section A2. 1 is therefore between buffer costs and 
link utilization, which are both critical resources. For the chosen 



buffering strategy an efficient 
Section trading off link Utilization 

4,2.1 Buffering Strategy 

Hie buffering strategy determines the location of buffers insida 
the router; We distinguish input queuing, output quoting, ftfld vir- 
tual output queuing. Ih the following, JV is the manner of inputs 
(equal to me number of outputs) of the router. We believe that in 
a balanced solution the rates at which routers and links operate is 
equal Slower routers require more rearing, and faster routers are 
not feasible as links operate at lush speed 

In iripnt queuing there is a single queue par input, resulting in 
toe lowest buffer cost (illogical queues in N physical memories) 
of an three approaches. However, due to the so-called headof- 
luic blocking for large iv* network utilization safurates at [8T 
Therefore, irmut queuing results in weak utilization of the links. 

Ourput queuing can increase me link utilization to 100% by hav- 
ing N queues at each output, or N 2 queues, wiih as many physical 
memories. It is better to have fewer larger, menioiiaa than more 
smaller memories because the overhead of small RaMs is very 
high. Cfcrercic^Sdag the renter by a factor N to use N rnernories 
is not possible, as argued previously. So the number of memories 
depends qnaoxaticany on N, hence output queuing is not scalable 
Virtual output mieuicg £1J (voq) combines the advantages of 
input queuing and output queuing. It has the buffering conmlexity 
cfmpnt queuing and the Jink utin^ 

output queuing, there are logical queues, buc may are combined 
in JV physical memories at the inputs as for input queuing. For 
every inpcjU there are JV queues Q(i t o\ one for each outputs see 
Wgore z. There is at most one write to these queues. difference 
between output and voq is the additional constraint that there can 
be at most one read rromtfris group of iV queues. (Tins enabled the 
moping of all input queues of the same input to one mernory.) This 
additional constraint has to be oaken into account by die sc&eduling. 
10055 link u ti li zation can still be achieved, when iVis large £121. 

Wo select VOQ because it combines high link utilization with 
moderate buffer costs. 

4.2J2 Matrix Scheduling 

IHs section shows bow link contention and memory contention 
(imposed by voq) are resolved. Matrix scheduling solves both 
kinds of contention by ensuring that every voq memory is read at 
most once, and every output dink) is written to at most once. The 
scheduling problem can be modeled as a bipartite graph matching 
problem as follows. Every input port i is modeled by a node and 
every output port o by a node v 0 - There is an edge between -u* and 
v* if and only if queue Q(i,o) is non-empty. A match is a subset 
of these edges sach that every node is incident to at most ons edge. 
For example, Kgnre 3(c) is a match of Figure 3 (a% The number of 
edges in the match is its size; a maf ch is maximal when no edges 
can be added to it. A maximum size match is a largest size matcb. 

Although optimal, there are two reasons not to consider only 




Bgnre 2i Schematic of a router using virtual output queuing. 
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Figure 3: The three steps of a single iSUP iteration* 

maxjnmw g*ga fmatehfls. First, irmximum size matching algorithms 
have 0{N 5 * 2 ) complexity Since matrix scheduling is done at flit 
raie tfais is not feasible for large N. Second, mnxnmun size match- 
ing algorithms can be unfair, which can result in starvation, ie., 
some queues are never served 

There are several snatching algorithms; see [U] for a thorough 
di&cussion* We select the iterative SUP (iSUP) matrix scheduling 
algorithm fl 1], because it has a low eonrptauty, avoids starvation, 
and provides increasing performance as the number of iterations 
grows. It reaches a rn?rxiiual match in Iog 2 (iV) iterations. Even a 
single iteration considerably outperforms inpu t cnierang, and can be 
efficiently implemented in hardware Multiple iterations increase 
the latency of the control path, and hence the flit size (as explained 
in Section 4-3.1), We consider using 1-SLIP because mtftrple iter- 
ations give only marginal ioTprovement 

A single iSIJPitaraHonhas three steps, illustrated by an example 
in Figure 3 for iV = 4. In the first stage, see Figure 3(a), every non- 
empty queue Q(i t o) requests access to output port 0 from input 
port i. In the second Singe, see Figure 3(b), every output port a 
grants one request, solving Knit contention at the output parts. In 
the third stage, see Hgure 3(c), every input port i accepts One grant, 
to resolve memory eonteation at the input port. Wc extend iSLIP 
to take network flow control into account. 

4*3 Combining the GT and BE Routers 

The OT and be router architectures are combined 19 share re- 
sources, in particular the links, memories, and switches. Moreover, 
best-effort traffic enables a packet-based programming model for 
the guaranteed bralnc, as shown later, in Section 4,3.2. 

The principal constraint for a combined router architecture is thai 
guaranteed services are never affected by best- effort services. Fig- 
ure 4(a) shows that, conceptually, the combined router contains 
both router architectures (fat lines represent data transport, thin 
it^ffg represent control transport). Incoming data is switched to ei- 
ther the GT or the be router. The GT traffic (the traffic that is served 
by the GT router) has the higher priority,, to maintain guarantees. 
This is ensured by the arbitration unit, which therefore affects the 
besfceffbrt scheduling, furthermore, best-effort packets can pro- 
gram the guaranteed router, as shown by the arrow labeled pro- 
gram. Thin lines going Irem the right to the left indicate network 
flow control, which is only required for best-effort packets because 
guaranteed blocks never encounter contention. 

On a shared link only one BE or GT data item can arrive or be 
pent at any point m time. Thus OT and be memories can be shared, 
keeping the number of memories at AT, xuWxN+N 2 logical queues 
in total. Figure 4(b) shows that the data path consisting of memo- 
ries and switch matrix is shared, and that the control paths of the BE 
and GT tourers are separate, yet interrelated. Moreover, the arbitra- 
tion unit of Figure 4(a) has been absorbed by the Bfi router. The 
following subsection shows how this can be done, 

4.3.1 Arbitration and Flit Size 

When combining gt and BE traffic in a single network the in*- 
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(a) conceptual view (b) hardware view 

Kgure 4; Two views of the combined CT-BJ5 router. 

pact on the network flow control scheme must be taken into ac- 
count. Recall from Section 3.4 that a BE flit is the smallest unit ct 
which flow control is performed. In other words, the BE scheduling, 
using iSIJP, can only react to GT blocks at flit granularity, lb avoid 
alignment problems, the biocJc size (2? words) is a multiple of the 
flits (F words, with B z= CF). t is constant; we prefer a small i and 
F to decrease the store-and-fbrward delay tor guaranteed traffic. 

We extend iSLSP to handle the combination of GT and BE traffic. 
In Bus combination gt traffic always has priority over BE traffic 
This is to ensure that guarantees are nevercurrurrtsi. 

432 Programming Model 

In this section we show how Gt connections are set up and torn 
down, by means of be packets. To ensure scalability, proe^annning 
must not require global or central ired resources. Section 4.1J2 ex- 
plains why our contention-free ranting uses slot tables; we now see 
tnat they are attributed ovnr routers for scalabDiry. 

Initially the slot table of every router is empty. This means that 
OT connections can only be set up using Bte packets, unless an ad* 
optional communication infrastructure is introduced solely for pro- 
gramming. Two special packets, Reset and Start, are used to reset 
and start the NoC, respectively They progress by flooding, and are 
not subject to the usual network; flow control. We will uot discuss 
them further. There are three system packets: Setup, TearDown, 
and AdtSstUp. They are Used to program the slot table in evciy 
router on their path. 

The Setup packet is used to create a connection from a source 
to a destination, and travels in the direction of the data ("down- 
snreauT). AckSetUp acknowledges a successful set up, and flows 
upstream. The TaarDown packet destroys (partially) existing con- 
nections, and can travel in either direction. Setup packets contain 
the source of the data, the path to their destination, and a slot num- 
ber. Every router along the path of the Setup packet checks if the 
output to the next router in the path is fiea in the slot mdicamd by 
me packet. If it is tree, the output is reserved in that slot, and the 
SetUp packet is forwarded with an incremented (modulo S) slot. 
Otherwise, the SetUp packet is discarded and a Tear Down packet 
returns along the same path. Thus every path must be reversible; 
this is the only assumption we make about the network topology. 
These upstream TearDown packets free the slot, and continue witb 
a decremented slot Downstream TearDown packets work similarly, 
and remove existing connections. A connection is successfully cre- 
ated when an AckSotUp is received, else a TearDown is received. 

The programming model is pipelined and concurrent (multiple 
system packets can be active in the network simultaneously, also 
from the same source) and distributed (active in multiple routers). 
Given the distributed nature of the programming model, ensuring 
consistency and determinism is crucial The outcome of program- 
ming may depend on the execution order of system packets, but is 
always consistent. The next section shows how to use the program- 
ming modal. 



□ . i . c\oxsc_ j±o- a ^^^m rnii_j.ro ^ir m_ tji qa cr^^ca^ 

PHNL02l03m>p 



W.ITO r.Di/OJ, 

051 08.10.2002 18r21:C 



4.33 Slot Allocation 

This section ^plains ways to determine the slots specified in 
Setup packfite. A slot allocaHnn for a single connection requires 
Out. a ovary router along the path, the required ourputisfreeiathe 
ampropnate slot Therefore, intolerance of SotUp packets of mul- 
tiple cormections can be completely avoided if connections are set 
up with conflict*** slots or paths. AH execution orders of SetUp 
packets then give the same result. 

Computing an optimal slot allocation is complex and requires a 
global network view, ft can be used only for small problem in- 
stances, lb reduce computational cost, heuristics can be used, but 
Bus probably leads to non-optimal solutions. Corapfle-tiine slot al- 
locations from both approaches can be recreated detttnm^ticaUy 
arnmtime, conairrenily and dlstntrntedly (because an SetUppack- 

At run time, a global view requires a centralized slot allocation. 
T3us unpairs scalability and slows cfrwn programming. Run-time 
distributed slot allocation is scalable, but lacks a global view. This 
typically results in suboptunal slot allocation. Moreover, SstUp 
packets may interfere, making progxarnming mom involved, and 
perhaps mm-c^ienninistic. However, dynamic cormection man- 
agement at high rates will require distributed slot allocation. In 
a simple distributed greedy algorithm, ail sources repeatedly gen- 
erate random slot numbers for each set up until their connection 

succeeds. v " 



We conclude mat our prosrarnmmg mooel allows both eompile- 
^ awl run-time slot allocation, Gunpumtional complexity, de- 
term&iisao results, and scalability can be balanced accordin * to svs- 
temreouiremsnts. J 

5. CONCLUSIONS 

Managing the complexity of deigning chips cxmnnmnp; billions 
of transistors requites decoupling computation ftom communica- 
uon. For communication, networks on chip (NoQ are emerging 
as an alternative for existing imerconnects to solve technological, 
perfcrmOTce, and scalability problems. r 

In this paper we show that guaran teed services are essential to 
provide predictable interconnects that enable compositional sysram 
design and integration. However, guarantees typically utilize re- 
sources inefficiently Best-effort services overcome Ibis problem 
but provide no giiaranteea, So, combining guaranteed and beat* 
effort services allows efficient resource utilization, yet still provid- 
ing guarantees for critical traffic. 

Time-related guarantees, such as throughput and latency, can 
onlybe contracted on a NoC matmtrinsicaUy has these prtmerties. 
We therefore define a ronter-based NoC aicbuecture tlm combines 
euaranteed and besr-effint services. Thus, the rower architecture 
has conceptually two parts; the guaxanteed throughput (err) and 
best-effort <ae) renters. Both offer data integrity, lossless data de- 
livery, and in-order data delivery. Additionally, the GT router offers 
guaranteed ttrroughput and latency services using pipelined circuit 
switching with time-division multiplexing. Tois requires a notion 
of synchromcity: at each time slot at most one block of data is com- 
irurnicated over a link. The gt router has low latency and moder- 
ate memory reojrircments. The BE router uses packet switching, 
wormhole routing, and virtual output queuing with iSUP The bb 
router has low latency, high link malization. and moderate memory 
requirements. 

We combine the gt and be router architectures efficiently by 
sharing router resources. The guarantees are never affected by the 
Bfi traffic, and links are efficiently utilized because BE traffic uses 
all bandwidth left over by GT traffic Gonnections are programmed 
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1 k Integrated circuit comprising a plurality of modules, and a network arranged 
for transferring messages between the modules, wherein a message issued by a 
module comprises first information indicative for a location of an addressed module 
within the network, and second information indicative for a location within the 
addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

2. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a message issued by a module comprises first information indicative 
for a location of an addressed module within the network, and second information 
indicative for a location within the addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

3 . Integrated circuit comprising a plurality of processing modules and a network 
arranged for providing at least one communication between a first and a second 
mdule, which communication channel supports transactions comprising outgoing 
messages from the first module to the second module and return messages from the 
second module to the first module, characterized in that the network manages the 
outgoing messages in a way different from the return messages. 

4. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a communication channel through the network supports transactions 
comprising outgoing messages from the first module to the second module and return 
messages from the second module to the first module^ characterized in that the 
network manages the outgoing messages in a way different from the return messages. 

5. Integrated circuit according to claim 3, wherein the network has a first mode 
wherein a message is transferred within a guaranteed time interval, and a second 
mode wherein a message is transferred as fast as possible with the available resources, 
wherein the outgoing transaction is a read message, requesting the second module to 
send data to the first module, wherein the return transaction is the data generated by 
the second module upon this request, and wherein the outgoing transaction is 
transferred according to the second mode, and the return transaction is transferred 
according to the first mode. 
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°T* 3CC ? rdmg to Claun 3 > wherai ^ *e network allows at least 
two ofme following transaction modes unordered, locally ordered and globally 
ordered, wherein an unordered transaction mode of the network gives no giiaxantees 
for foe order in which messages will anive at their destination, a Tocally ordered 
ftansacbon mode guarantees that messages sent to foe same destination will arrive in 
the same order as they were sent, a global ordered transaction mode guarantees that 
messages will arrive in foe same order as they were sent even if they are sent to 

7. Integrated cironit according to claim 3, wherein the network reserves a first 
and a second buffer space for foe first and the second module respectively the 
bufferspaces having a mutually different size. . 

8. Integrated circuit comprising a plurality of modules, which modules are 
arranged to communicate to each other via a network, wherein foe network is 
arranged to distribute a message from a first module to two or more second modules 
and wherein foe second modules are arranged to generate an acknowledge message ' 
radicating receipt of the message from the first module, 

foe network being arranged to generate a single return message to foe first module, in 
dependence of foe acknowledge messages of foe second modules. 

9. Integrated circuit according to claim 8, wherein the single return message 
indicates that at least one of foe second modules has received foe message issued bv 
the first module. 

10. Integrated circuit according to claim 8, wherein foe single return message 
m&cates that each of the second modules has received foe message issued by the first 

11. Megrated circuit conmrising a firpt plurality of processing modules and a 
network, the network comprising a second pforality of nodes and interconnections 
between nodes, foe network being arranged for transferring messages between a first 
and a second modules via a path through foe network, foe processing modules coupled 
to foe network via a network interface having a buffer for receiving incoming 
messages, wherein a message from a first to a second module is not initiated until the 
buffer has sufficient space for receiving a return message from foe second module. 
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Abstract 

Networks are emerging as a possible solution for on- 
cln'p interconnects, hx this paper, we describe how net- 
works on chip (NoC) ere similar to and differ from 
both off-chip networks (ag,, computer networks) and cur- 
rent on-chip interconnects (e.g., buses). We re-examine 
the communication services in the context of NoCs. We 
provide services that abstract from network implementa- 
tions enabling a clean separation between the NoC and 
IP blocks. Wo define a request-response transaction model 
similar to bos protocols, making our approach back- 
ward compatible. To exploit the full power of NoCs, 
we also provide connection-oriented communication with 
differentiated services. Examples are bandwidth guaran- 
tees, transaction ordering?, and end-to-end flow control. 

Key Words: Networks on chip, on-chip buses, compurer 
networks, communication services, protocol stack, 
transaction, connection. 
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